#panfrost on 2022-08-18 — irc logs at oftc.irclog.whitequark.org

2022-08-14 19:45 ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular

00:00 jambalaya has joined #panfrost

00:15 kenzie7 has joined #panfrost

00:17 warpme___ has quit []

00:42 kenzie7 has quit []

00:43 benpouls_ has joined #panfrost

00:48 kenzie7 has joined #panfrost

00:49 benpoulson has quit [Ping timeout: 480 seconds]

01:05 floof58 has quit [Ping timeout: 480 seconds]

01:07 floof58 has joined #panfrost

01:20 floof58 is now known as Guest408

01:20 floof58 has joined #panfrost

01:24 Guest408 has quit [Ping timeout: 480 seconds]

02:26 benpouls_ has quit [Remote host closed the connection]

02:40 floof58 has quit [Ping timeout: 480 seconds]

02:42 benpoulson has joined #panfrost

02:43 floof58 has joined #panfrost

02:59 benpoulson has quit [Remote host closed the connection]

02:59 benpoulson has joined #panfrost

03:12 davidlt has joined #panfrost

04:05 hanetzer1 has joined #panfrost

04:06 hanetzer has quit [Ping timeout: 480 seconds]

04:12 carlosstive[m] has quit [autokilled: Spambot. Mail support@oftc.net if you think this is in error. (2022-08-18 04:12:09)]

04:17 hanetzer2 has joined #panfrost

04:19 hanetzer1 has quit [Ping timeout: 480 seconds]

04:21 davidlt has quit [Ping timeout: 480 seconds]

04:58 soreau has quit [Read error: Connection reset by peer]

04:59 soreau has joined #panfrost

05:11 nlhowell has joined #panfrost

05:56 davidlt has joined #panfrost

06:20 tomeu5 has joined #panfrost

06:20 sergi8 has joined #panfrost

06:21 ndufresne5 has joined #panfrost

06:21 ndufresne has quit [Read error: Connection reset by peer]

06:21 sergi has quit [Write error: connection closed]

06:25 MajorBiscuit has joined #panfrost

06:25 tomeu has quit [Ping timeout: 480 seconds]

06:27 Major_Biscuit has quit [Ping timeout: 480 seconds]

08:18 warpme___ has joined #panfrost

08:43 indy_ has joined #panfrost

08:45 indy has quit [Ping timeout: 480 seconds]

08:59 rkanwal has joined #panfrost

09:17 atler has joined #panfrost

09:23 indy_ is now known as indy

09:24 anarsoul|2 has joined #panfrost

09:24 anarsoul has quit [Read error: No route to host]

09:51 zhxt_ has quit [Read error: Connection reset by peer]

09:55 zhxt_ has joined #panfrost

10:54 anarsoul has joined #panfrost

10:55 anarsoul|2 has quit [Read error: Connection reset by peer]

11:17 warpme___ has quit []

11:27 Danct12 has quit [Read error: Connection reset by peer]

11:27 rkanwal has quit [Read error: Connection reset by peer]

11:28 rkanwal has joined #panfrost

12:46 ndufresne5 is now known as ndufresne

13:24 nlhowell has quit [Ping timeout: 480 seconds]

14:12 soreau has quit [Read error: Connection reset by peer]

14:12 soreau has joined #panfrost

14:47 <greenjustin> Would folks here happen to know much about the performance of Mali texture pipelines? I've found this really good doc for Midgard, but I can't seem to find anything publicly available about Bifrost or Valhall, was curious if someone here had any performance numbers: https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/the-mali-gpu-an-abstract-machine-part-3---the-midgard-shader-core

15:05 <robmur01> there's a matching set of blogs about Bifrost, not sure about Valhall though

15:06 <robmur01> oh, my bad, it's actually part 4 of that same blog series

15:12 <greenjustin> oh! thank you so much!

15:12 <greenjustin> ok so it's still about 1 texel per core

15:12 <greenjustin> *per clock

15:13 <robmur01> there are various other nuggets scattered around in Arm docs like the best practices guide and GPU datasheet

15:18 <robmur01> of course subsequent GPUs iterated quite hard on the initial G71 design, so it's ended up probably being the least representative of Bifrost overall

15:18 <greenjustin> Ah, and it looks like valhall is 4 texels per clock per core

15:19 <greenjustin> https://developer.arm.com/documentation/102203/0100/Texture-unit

15:19 <greenjustin> Yeah but I think the MT8183 is G71, right? So that's probably still relevant to what I'm doing

15:43 zekra has joined #panfrost

15:47 zekra has left #panfrost [#panfrost]

15:53 stepri01 has quit [Remote host closed the connection]

16:40 Danct12 has joined #panfrost

17:09 MajorBiscuit has quit [Quit: WeeChat 3.5]

17:15 karolherbst has quit [Read error: Connection reset by peer]

17:15 karolherbst_ has joined #panfrost

17:16 karolherbst_ is now known as karolherbst

17:19 Lyude has quit [Quit: Bouncer restarting]

17:20 Lyude has joined #panfrost

18:11 davidlt has quit [Ping timeout: 480 seconds]

18:50 davidlt has joined #panfrost

18:51 rkanwal has quit [Ping timeout: 480 seconds]

18:53 rasterman has joined #panfrost

19:09 alyssa has joined #panfrost

19:10 <alyssa> greenjustin: there are some nuggets in https://developer.arm.com/documentation/102849/latest/

19:14 pch has quit [Ping timeout: 480 seconds]

19:15 pch has joined #panfrost

20:00 davidlt has quit [Ping timeout: 480 seconds]

20:00 paulk has joined #panfrost

20:11 <greenjustin> alyssa: Thank you!

20:12 <greenjustin> This might be a dumb question, but does the pixel format matter for the texel performance?

20:12 <alyssa> n

20:12 <alyssa> yeah, absolutely

20:12 <alyssa> memory bandwidth if nothing else

20:12 <greenjustin> oh interesting, like would I be able to fetch 4 texels a cycle if the format were e.g. R8?

20:13 <alyssa> ETC2 requires half the memory bandwidth of RGBA8, which requires 1/4 of RGBA32

20:13 * alyssa shrugs

20:13 <alyssa> I don't know the specific uarch details of the Mali texturing pipe

20:13 <alyssa> but in general you can expect better performance with better formats

20:16 <greenjustin> That's interesting. Yeah I figured chroma subsampling for example would greatly improve memory bandwidth, but the docs make it almost sound like the texture fetch unit itself can only process about 1 texel request or w/e per cycle for some reason

20:22 <alyssa> sure

20:22 <alyssa> "for some reason" remember that hardware is deeply pipelined

20:22 <alyssa> uarch design isn't like optimizing software

20:24 <alyssa> (e.g. doing more in one cycle can slow things down overall if the extra propagation delay forces you to lower the clock rate of the whole chip)

20:24 <alyssa> That data sheet is also greatly simplified from the real hw behaviours, of course.

20:25 MajorBiscuit has joined #panfrost

20:26 <greenjustin> Interesting. I guess I assumed it worked more like CPU uarch, where for example you could issue a whole bunch of store instructions and they'd likely run simultaneously on a superscalar arch

20:26 <greenjustin> At least up until you saturate the bus

20:26 <greenjustin> (Or write combiner, or whatever)

20:27 <greenjustin> I guess in my head I would have assumed the texel unit worked the same way. E.g. you can fetch 1 ARGB pixel, or 4 R8 pixels per clock

20:28 * alyssa shrugs

20:29 <alyssa> I'm not a hardware person

20:29 <alyssa> I will say that texturing is /very/ complicated

20:29 <alyssa> it's not just a texel fetch

20:29 <alyssa> s/texturing/sampling/

20:30 <alyssa> and cache friendliness is the single biggest indicator of performance overall...

20:30 <greenjustin> That's fair yeah. I could easily see implementing an optimization like that introducing more propagation delay than it's worth.

20:31 <alyssa> overall R8 will still be more efficient than ARGB8

20:31 <alyssa> whether that's noticeable as lower FPS or just lower power consumption will depend on piles and piles of factors

20:31 <alyssa> the "1 pixel per clock" Arm quotes in the data sheet is just talking about the texturing unit proper, I think

20:32 <alyssa> which is good to know but maybe not super helpful on its own

20:32 <alyssa> A single texture request can take 100s of cycles, if the cache is cold and you're having a real bad day

20:33 <greenjustin> Right, I assume computing actual performance characteristics is tough and depends on a bunch of factors. Some of which might not even be public information?

20:35 <greenjustin> I was just trying to upper bound the throughput for a shader we've been experimenting with. I'm fairly certain it's texture fetch bound, so I've been using "texel fetch per clock" * "number of shaders" * "max clock speed"

20:42 <alyssa> not sure what the second factor is in for

20:43 <alyssa> "texel fetch per clock" * "max clock speed" is the upper bound on the texturing unit performance

20:43 <alyssa> but that doesn't factor in the (potentially much slower) memory bus

20:43 <alyssa> or the caching system

20:44 <alyssa> it's probably best just to benchmark the 'real' behaviour for your workload than try to imagine best case scenarios that are impossible to actually achieve, idk

20:56 Intdtti has quit [autokilled: This host violated network policy and has been banned. Mail support@oftc.net if you think this is in error. (2022-08-18 20:56:55)]

20:57 <greenjustin> Oh? Is there not 1 texture fetch unit per shader core?

20:58 _op62 has quit [autokilled: This host violated network policy and has been banned. Mail support@oftc.net if you think this is in error. (2022-08-18 20:58:13)]

21:01 <alyssa> oh, misunderstood you

21:01 <alyssa> sorry

21:08 MajorBiscuit has quit [Quit: WeeChat 3.5]

21:09 hyrc has joined #panfrost

21:19 hyrc has quit []

21:52 rasterman has quit [Quit: Gettin' stinky!]

21:56 alyssa has quit [Quit: leaving]

22:02 <daniels> greenjustin: locality is definitely the most important thing, which I appreciate is difficult for your uses … we also don’t have any great insight into how the DDK schedules fetches, which is critical here as Bifrost has a fun hardware design around dependency scheduling

22:07 <greenjustin> daniels: Yeah I'm almost certain our cache miss rate is terrible and there isn't much we can do about it. What I'm starting to think though is that even with a 100% cache hit rate, the GPU is just not suited for this type of work

22:08 <daniels> greenjustin: splitting the two planes into independent passes may or may not help. the hardware does have dual-issue tex support but iirc that only kicks in when using identical normalised coords, so wouldn’t happen with texelFetch

22:09 megatradeusa[m] has quit [Excess Flood]

22:12 <daniels> (as a thought, you might even be better off doing rather than vert/frag and playing with the dimensions there to see what effect that has?)

22:12 <greenjustin> daniels: I think an even bigger problem than the texelFetch might be the 2 pixel *writes* per clock. By my calculations, the maximum throughput just based on that bottleneck alone is ~8.66 gigapixels per second

22:13 <greenjustin> vs our SIMD version is like 11.7 GP/s per core

22:14 <daniels> right, hence the suggestion of either doing one plane at a time rather than aggregating and writing both in a single pass - or going the other way where you can go much wider with compute to try to get maximum locality from tile reads

22:14 megatradeusa[m] has joined #panfrost

22:16 <daniels> I don’t know if doing writes out from compute is going to be better or worse than the frag store unit, but could be worth a shot …

22:17 <greenjustin> oh I think I know what you're saying

22:17 <greenjustin> like tell GL the textures are super wide and 1 pixel tall and use that to better exploit the write combiner?

22:18 <daniels> well. at the moment you’re using a GL draw call

22:18 <daniels> with texture reads and fragment writes

22:20 <daniels> if you use a compute shader (or OpenCL) with ins and outs as direct memory access, rather than executing once per output pixel for the fragment shader, you can define your execution ‘width’ which effectively gangs your invocations together

22:21 <daniels> so you can choose that based on e.g. input tile dimensions, and then play around with whether to use the texture fetch unit or do I/O as mem loads & stores through SSBOs

22:21 <daniels> it may be massively worse, or it might be quite a bit better

22:21 <daniels> but if you’re basically just doing weird blits, you could very well benefit from going to compute shaders rather than the full draw pipeline

22:24 <greenjustin> that's an interesting idea, I will have to look into that

22:24 <greenjustin> thank you!

22:27 <daniels> np!

23:20 bluetail has quit [Quit: Ping timeout (120 seconds)]

23:20 bluetail has joined #panfrost