#panfrost on 2023-01-22 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:45 ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard + Bifrost + Valhall - Logs https://oftc.irclog.whitequark.org/panfrost - I don't know anything about WSI. That's my story and I'm sticking to it.

00:05 floof58 is now known as Guest2062

00:05 floof58 has joined #panfrost

00:07 Guest2062 has quit [Ping timeout: 480 seconds]

00:26 rasterman has quit [Quit: Gettin' stinky!]

00:46 kinkinkijkin has quit [Quit: Leaving]

01:52 floof58 is now known as Guest2069

01:52 floof58 has joined #panfrost

01:55 Guest2069 has quit [Ping timeout: 480 seconds]

01:59 Leopold__ has quit [Remote host closed the connection]

02:15 Leopold has joined #panfrost

02:25 atler is now known as Guest2071

02:25 atler has joined #panfrost

02:27 Guest2071 has quit [Ping timeout: 480 seconds]

03:00 Leopold has quit [Remote host closed the connection]

03:13 Leopold has joined #panfrost

05:04 <alyssa> hacked together support for nir_opt_preamble on v9

05:04 <alyssa> aka "pilot shaders"

05:04 <alyssa> so far it's a loss on every workload I've tried :|

05:04 <HdkR> :|

05:04 <alyssa> so either I've screwed something up terribly or this isn't viable.

05:05 <alyssa> 1x1x1 compute kernels are not good for gpus.

05:05 * alyssa wonders what heuristic Arm uses

05:12 <alyssa> some workload perf halved, not even like a 2% hit or something

05:12 <alyssa> perf counters are less useful than I would have liked

05:12 <alyssa> all of this concerns me because it has serious implications for vulkan performance

05:12 <alyssa> especially when you throw stuff like update-after-bind into the equation

05:13 <alyssa> though maybe if we just make liberal use of the CSF it'd work out ...

05:15 alarumbe has quit [Read error: Connection reset by peer]

05:15 <alyssa> on the GL side this rules out being able to do threaded context efficiently, I guess :|

05:20 alarumbe has joined #panfrost

05:21 cphealy has quit [Quit: Leaving]

05:25 <alyssa> looking at something like Dolphin (last weekend's test) ... yeah, end up a bit slower than the GPU copy shader

05:25 <alyssa> and even slower than just disabling push of UBOs

05:25 <alyssa> but much faster than trying to push UBOs on the CPU

05:26 <alyssa> wait but this isn't a release build uh

05:27 * alyssa doesn't expect substantially different results

05:31 <alyssa> hum

07:34 davidlt has joined #panfrost

10:17 Leopold has quit [Ping timeout: 480 seconds]

10:20 Leopold_ has joined #panfrost

10:26 rasterman has joined #panfrost

10:49 <rah> alyssa: OK thanks

10:50 Dr_Who has quit [Read error: Connection reset by peer]

11:10 rasterman has quit [Quit: Gettin' stinky!]

11:11 Leopold_ has quit [Ping timeout: 480 seconds]

11:13 Leopold_ has joined #panfrost

11:37 davidlt has quit [Ping timeout: 480 seconds]

11:53 kinkinkijkin has joined #panfrost

13:16 davidlt has joined #panfrost

13:23 kinkinkijkin has quit [Read error: Connection reset by peer]

15:18 pendingchaos has quit [Ping timeout: 480 seconds]

15:22 Leopold___ has joined #panfrost

15:25 Leopold_ has quit [Ping timeout: 480 seconds]

15:44 hanetzer has joined #panfrost

16:18 Leopold___ has quit [Ping timeout: 480 seconds]

16:28 Leopold_ has joined #panfrost

16:34 <alyssa> so, as I was going to bed last night I had a nagging suspicion that the GPU is doing some sort of funny compute vs tiler barriers

16:34 <alyssa> so I wrote a stupid-simple patch just now to reorder the job chain I'm emitting for preambles, promoting all the compute kernels to the front of the chain

16:35 <alyssa> so now for 3 draws that each have a preamble for their VS, it looks like

16:35 <alyssa> compute -> compute -> compute -> IDVS -> IDVS -> IDVS

16:35 <alyssa> instead of

16:35 <alyssa> compute -> IDVS -> compute -> IDVS -> compute IDVS

16:35 <alyssa> this more than doubles perf on draw call heavy tests

16:35 <alyssa> still a loss compared to no preambles on e.g. webgl aquarium at 5000 fish

16:35 <alyssa> but definitely getting there

16:37 <alyssa> oh but we're falling down the too many draws path of doom

16:38 <alyssa> ok, yeah, bumping the limit avoids an extra flush in there

16:39 <alyssa> so now auqarium at 5000 fish with preambles is up to ~37fps and without preambles was at ~41fps

16:39 <alyssa> (this is all at 1080p on an MT8192 chromebook in Chromium with the GLES renderer on top of panfrost/next which has some optimizations I added the past 2 weeks that haven't made their way upstream yet)

16:39 <alyssa> (aquarium will be slower on your panfrost)

16:48 <alyssa> reusing resource tables between preamble + main shader gets me an extra fps back

16:49 <alyssa> (the JS1_WAIT_READ counter lit up -- descriptor reads were a hot spot)

16:49 <alyssa> Non-fragment descriptor read cycles (js1_wait_read): 160182643

16:50 <alyssa> still lighting up as a hot spot.

16:53 kinkinkijkin has joined #panfrost

16:53 falk689_ has joined #panfrost

16:55 <alyssa> 2770, 1773, 1505, 82 --> 2719, 1316, 1499, 82

16:55 <alyssa> that's still a loss across the board :|

16:56 <alyssa> glmark2 -bideas down 25% is a big yikes

16:56 falk689 has quit [Ping timeout: 480 seconds]

16:58 <alyssa> this is despite -bideas having some nontrivial preambles (the sort you might hope would benefit)

17:09 <alyssa> hey, still progress.

17:09 <alyssa> will revisit this some other time

17:33 pendingchaos has joined #panfrost

18:20 nlhowell has joined #panfrost

19:54 nlhowell has quit [Ping timeout: 480 seconds]

19:56 <hays> cool

20:11 davidlt has quit [Ping timeout: 480 seconds]

20:27 falk689_ has quit [Remote host closed the connection]

20:27 falk689 has joined #panfrost

20:32 falk689 has quit [Remote host closed the connection]

20:34 falk689 has joined #panfrost

21:20 <robclark> webgl aquarium is really just a uniform upload benchmark

21:27 Leopold_ has quit [Ping timeout: 480 seconds]

21:37 Leopold_ has joined #panfrost

21:51 falk689 has quit [Remote host closed the connection]

21:51 falk689 has joined #panfrost

22:00 Leopold_ has quit [Remote host closed the connection]

22:01 Leopold_ has joined #panfrost

22:31 mensi has joined #panfrost

22:32 <alyssa> robclark: yep.

22:32 <alyssa> which means if I'm getting a 10% hit on it when optimizing my uniform upload code I didn't do so hot :-p

22:39 <robclark> I don't think preambles help webgl aquarium.. and also webgl aquarium is not a representative workload (aka benchmark) ;-)

22:39 <robclark> maybe it matters for tuning for threshold of when to bother w/ preamble

22:40 <robclark> ie. you shouldn't for aquarium

22:45 <alyssa> robclark: the actual motivation here is that Mali only lets us push a single contiguous range of GPU memory

22:46 Leopold_ has quit [Ping timeout: 480 seconds]

22:46 <alyssa> which means there has to be a gather operation to combine uniforms/driver sysvals/UBOs/whatever into a contiguous buffer

22:46 <alyssa> currently that happens on the CPU

22:47 <alyssa> which means perf nosedives when UBOs are used heavily because of all the readback from uncached (wc) UBO memory

22:47 <alyssa> and also complicates Vulkan because stuff like txs is currently handled as magic indexed sysvals

22:48 <alyssa> thats ugly in gles but workable, it will get a lot worse when descriptor sets are added in, and even worse for bindless

22:48 <alyssa> The siren song of preambles is offering a neat solution to all of this and more

22:49 <alyssa> Pushing UBOs without the WC readback pain will require a compute kernel on the GPU to do the gather operation (unless you only read a single contiguous range from a single UBO with no driver sysvals)

22:49 <alyssa> If you're going to need a compute kernel for the copy, might as well go all in and let nir_opt_preamble make the decisions of what to push (it will make better decisions than our dumb as rocks backend pass)

22:50 <mensi> hi! Is this conversation from a month ago: https://oftc.irclog.whitequark.org/panfrost/2022-12-20#31734646 still accurate and rusticl is not yet expected to work with panfrost? I managed to compile and run everything from git and got clinfo to use rusticl but it listed 0 devices

22:50 <alyssa> correct

22:50 <alyssa> and then we can lower txs to "load descriptor from memory and extract the bits" and still get the answer in a uniform so we get a single txs implementation that doesn't care what binding model is used and doesn't require the cpu to do stupid things

22:51 <alyssa> and as a bonus we *also* get uniform-on-uniform arithmetic optimized

22:52 Leopold_ has joined #panfrost

22:52 <alyssa> (the ISA assumes that there's no uniform-on-uniform arithmetic, requiring extra moves for that case, there's no scalar unit or anything like that)

22:53 <alyssa> Unfortunately the extra GPU draw overhead from all the compute kernels is significant :|

22:53 <alyssa> especially for gles2 apps where everything is on the CPU happy path and there's no WC readback

22:54 <alyssa> (but that means I can't wire up threaded_context, and I'm not sure I can afford not to wire up tc..)

22:55 <alyssa> my last ditch idea is to use nir_opt_preamble to make the decisions of WHAT to push, but then optimize out "store_preamble(load_uniform)" statements from the preamble shader (running them on the CPU with our traditional CPU gather code)

22:56 <alyssa> which eliminates the compute kernel except when the app actually does uniform-on-uniform arithmetic or reads from UBOs or does txs if we go that lowering route

22:57 <alyssa> which would probably avoid regressing gles2 apps

22:57 <alyssa> (like aquarium)

22:57 <alyssa> of course, that's no good if we flip on tc, because then EVERYTHING is a UBO

22:58 <alyssa> and the only time that would let us save the compute kernel is if the app doesn't use any GL uniforms or UBOs

22:58 <HdkR> Time to switch everything to SSBOs

22:58 <alyssa> blink

22:59 <alyssa> nir_opt_preamble can probably push readonly SSBOs ;)

22:59 <alyssa> I also suspect that the preamble overhead is a lot lower on v10

22:59 <alyssa> and if we only support v10 (and newer) for vulkan, then .. I guess we use preambles aggressively on vulkan and leave gl with the traditional path

23:00 <alyssa> that complicates the ABI but meh, trying to do preambles everywhere means writing greenfield Midgard compiler code so maybe not the end of the world :~P

23:00 falk689 has quit [Remote host closed the connection]

23:02 falk689 has joined #panfrost

23:03 <robclark> alyssa: aquarium is, iirc, roughly: for (i in 0 .. nfish) { upload vec4 color; glDraw(); }

23:04 <robclark> so only reason to look at it for opt is to make sure your heuristics don't pick a dumb path for a dumb case ;-)

23:05 <alyssa> fair enough :p

23:07 <alyssa> robclark: what's frustrating is that Arm could have avoided this whole mess with a small hw change

23:07 <alyssa> and yet

23:08 <robclark> repeat after me: glxgears^Daquarium is not a benchmark ;-)

23:08 <alyssa> heh

23:08 <alyssa> on AGX the driver gets to map as many ranges of GPU memory as it wishes to uniform registers

23:09 <alyssa> which means we can push whatever uniforms/UBOs/sysvals we want with no WC readback and no shenanigans

23:09 mensi has quit [Remote host closed the connection]

23:09 <alyssa> and apple *also* threw in hw support for preambles becuase they're nice like that

23:10 <robclark> from time to time whatever various $vendor gets in a huff about aquarium.. but it is *soo* unrepresentative of reasonable gl workloads

23:11 <robclark> it's basically an example of ow not to do gl

23:11 <robclark> it's basically glxgears with fish

23:11 <robclark> (and I'm vegetarian so idc)

23:12 * robclark has some hate for dumb benchmarks ;-)

23:32 <alyssa> lololol

23:33 <alyssa> well, this mess started (in part) with Dolphin

23:33 <alyssa> which is a very different example of how not to do gl :~P

23:42 <HdkR> It's a whole collection of real games though :P

23:43 <alyssa> yes, that's the problem :~)