ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard + Bifrost + Valhall - Logs https://oftc.irclog.whitequark.org/panfrost - I don't know anything about WSI. That's my story and I'm sticking to it.
floof58 is now known as Guest2062
floof58 has joined #panfrost
Guest2062 has quit [Ping timeout: 480 seconds]
rasterman has quit [Quit: Gettin' stinky!]
kinkinkijkin has quit [Quit: Leaving]
floof58 is now known as Guest2069
floof58 has joined #panfrost
Guest2069 has quit [Ping timeout: 480 seconds]
Leopold__ has quit [Remote host closed the connection]
Leopold has joined #panfrost
atler is now known as Guest2071
atler has joined #panfrost
Guest2071 has quit [Ping timeout: 480 seconds]
Leopold has quit [Remote host closed the connection]
Leopold has joined #panfrost
<alyssa> hacked together support for nir_opt_preamble on v9
<alyssa> aka "pilot shaders"
<alyssa> so far it's a loss on every workload I've tried :|
<HdkR> :|
<alyssa> so either I've screwed something up terribly or this isn't viable.
<alyssa> 1x1x1 compute kernels are not good for gpus.
* alyssa wonders what heuristic Arm uses
<alyssa> some workload perf halved, not even like a 2% hit or something
<alyssa> perf counters are less useful than I would have liked
<alyssa> all of this concerns me because it has serious implications for vulkan performance
<alyssa> especially when you throw stuff like update-after-bind into the equation
<alyssa> though maybe if we just make liberal use of the CSF it'd work out ...
alarumbe has quit [Read error: Connection reset by peer]
<alyssa> on the GL side this rules out being able to do threaded context efficiently, I guess :|
alarumbe has joined #panfrost
cphealy has quit [Quit: Leaving]
<alyssa> looking at something like Dolphin (last weekend's test) ... yeah, end up a bit slower than the GPU copy shader
<alyssa> and even slower than just disabling push of UBOs
<alyssa> but much faster than trying to push UBOs on the CPU
<alyssa> wait but this isn't a release build uh
* alyssa doesn't expect substantially different results
<alyssa> hum
davidlt has joined #panfrost
Leopold has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
rasterman has joined #panfrost
<rah> alyssa: OK thanks
Dr_Who has quit [Read error: Connection reset by peer]
rasterman has quit [Quit: Gettin' stinky!]
Leopold_ has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
davidlt has quit [Ping timeout: 480 seconds]
kinkinkijkin has joined #panfrost
davidlt has joined #panfrost
kinkinkijkin has quit [Read error: Connection reset by peer]
pendingchaos has quit [Ping timeout: 480 seconds]
Leopold___ has joined #panfrost
Leopold_ has quit [Ping timeout: 480 seconds]
hanetzer has joined #panfrost
Leopold___ has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
<alyssa> so, as I was going to bed last night I had a nagging suspicion that the GPU is doing some sort of funny compute vs tiler barriers
<alyssa> so I wrote a stupid-simple patch just now to reorder the job chain I'm emitting for preambles, promoting all the compute kernels to the front of the chain
<alyssa> so now for 3 draws that each have a preamble for their VS, it looks like
<alyssa> compute -> compute -> compute -> IDVS -> IDVS -> IDVS
<alyssa> instead of
<alyssa> compute -> IDVS -> compute -> IDVS -> compute IDVS
<alyssa> this more than doubles perf on draw call heavy tests
<alyssa> still a loss compared to no preambles on e.g. webgl aquarium at 5000 fish
<alyssa> but definitely getting there
<alyssa> oh but we're falling down the too many draws path of doom
<alyssa> ok, yeah, bumping the limit avoids an extra flush in there
<alyssa> so now auqarium at 5000 fish with preambles is up to ~37fps and without preambles was at ~41fps
<alyssa> (this is all at 1080p on an MT8192 chromebook in Chromium with the GLES renderer on top of panfrost/next which has some optimizations I added the past 2 weeks that haven't made their way upstream yet)
<alyssa> (aquarium will be slower on your panfrost)
<alyssa> reusing resource tables between preamble + main shader gets me an extra fps back
<alyssa> (the JS1_WAIT_READ counter lit up -- descriptor reads were a hot spot)
<alyssa> Non-fragment descriptor read cycles (js1_wait_read): 160182643
<alyssa> still lighting up as a hot spot.
kinkinkijkin has joined #panfrost
falk689_ has joined #panfrost
<alyssa> 2770, 1773, 1505, 82 --> 2719, 1316, 1499, 82
<alyssa> that's still a loss across the board :|
<alyssa> glmark2 -bideas down 25% is a big yikes
falk689 has quit [Ping timeout: 480 seconds]
<alyssa> this is despite -bideas having some nontrivial preambles (the sort you might hope would benefit)
<alyssa> hey, still progress.
<alyssa> will revisit this some other time
pendingchaos has joined #panfrost
nlhowell has joined #panfrost
nlhowell has quit [Ping timeout: 480 seconds]
<hays> cool
davidlt has quit [Ping timeout: 480 seconds]
falk689_ has quit [Remote host closed the connection]
falk689 has joined #panfrost
falk689 has quit [Remote host closed the connection]
falk689 has joined #panfrost
<robclark> webgl aquarium is really just a uniform upload benchmark
Leopold_ has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
falk689 has quit [Remote host closed the connection]
falk689 has joined #panfrost
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #panfrost
mensi has joined #panfrost
<alyssa> robclark: yep.
<alyssa> which means if I'm getting a 10% hit on it when optimizing my uniform upload code I didn't do so hot :-p
<robclark> I don't think preambles help webgl aquarium.. and also webgl aquarium is not a representative workload (aka benchmark) ;-)
<robclark> maybe it matters for tuning for threshold of when to bother w/ preamble
<robclark> ie. you shouldn't for aquarium
<alyssa> robclark: the actual motivation here is that Mali only lets us push a single contiguous range of GPU memory
Leopold_ has quit [Ping timeout: 480 seconds]
<alyssa> which means there has to be a gather operation to combine uniforms/driver sysvals/UBOs/whatever into a contiguous buffer
<alyssa> currently that happens on the CPU
<alyssa> which means perf nosedives when UBOs are used heavily because of all the readback from uncached (wc) UBO memory
<alyssa> and also complicates Vulkan because stuff like txs is currently handled as magic indexed sysvals
<alyssa> thats ugly in gles but workable, it will get a lot worse when descriptor sets are added in, and even worse for bindless
<alyssa> The siren song of preambles is offering a neat solution to all of this and more
<alyssa> Pushing UBOs without the WC readback pain will require a compute kernel on the GPU to do the gather operation (unless you only read a single contiguous range from a single UBO with no driver sysvals)
<alyssa> If you're going to need a compute kernel for the copy, might as well go all in and let nir_opt_preamble make the decisions of what to push (it will make better decisions than our dumb as rocks backend pass)
<mensi> hi! Is this conversation from a month ago: https://oftc.irclog.whitequark.org/panfrost/2022-12-20#31734646 still accurate and rusticl is not yet expected to work with panfrost? I managed to compile and run everything from git and got clinfo to use rusticl but it listed 0 devices
<alyssa> correct
<alyssa> and then we can lower txs to "load descriptor from memory and extract the bits" and still get the answer in a uniform so we get a single txs implementation that doesn't care what binding model is used and doesn't require the cpu to do stupid things
<alyssa> and as a bonus we *also* get uniform-on-uniform arithmetic optimized
Leopold_ has joined #panfrost
<alyssa> (the ISA assumes that there's no uniform-on-uniform arithmetic, requiring extra moves for that case, there's no scalar unit or anything like that)
<alyssa> Unfortunately the extra GPU draw overhead from all the compute kernels is significant :|
<alyssa> especially for gles2 apps where everything is on the CPU happy path and there's no WC readback
<alyssa> (but that means I can't wire up threaded_context, and I'm not sure I can afford not to wire up tc..)
<alyssa> my last ditch idea is to use nir_opt_preamble to make the decisions of WHAT to push, but then optimize out "store_preamble(load_uniform)" statements from the preamble shader (running them on the CPU with our traditional CPU gather code)
<alyssa> which eliminates the compute kernel except when the app actually does uniform-on-uniform arithmetic or reads from UBOs or does txs if we go that lowering route
<alyssa> which would probably avoid regressing gles2 apps
<alyssa> (like aquarium)
<alyssa> of course, that's no good if we flip on tc, because then EVERYTHING is a UBO
<alyssa> and the only time that would let us save the compute kernel is if the app doesn't use any GL uniforms or UBOs
<HdkR> Time to switch everything to SSBOs
<alyssa> blink
<alyssa> nir_opt_preamble can probably push readonly SSBOs ;)
<alyssa> I also suspect that the preamble overhead is a lot lower on v10
<alyssa> and if we only support v10 (and newer) for vulkan, then .. I guess we use preambles aggressively on vulkan and leave gl with the traditional path
<alyssa> that complicates the ABI but meh, trying to do preambles everywhere means writing greenfield Midgard compiler code so maybe not the end of the world :~P
falk689 has quit [Remote host closed the connection]
falk689 has joined #panfrost
<robclark> alyssa: aquarium is, iirc, roughly: for (i in 0 .. nfish) { upload vec4 color; glDraw(); }
<robclark> so only reason to look at it for opt is to make sure your heuristics don't pick a dumb path for a dumb case ;-)
<alyssa> fair enough :p
<alyssa> robclark: what's frustrating is that Arm could have avoided this whole mess with a small hw change
<alyssa> and yet
<robclark> repeat after me: glxgears^Daquarium is not a benchmark ;-)
<alyssa> heh
<alyssa> on AGX the driver gets to map as many ranges of GPU memory as it wishes to uniform registers
<alyssa> which means we can push whatever uniforms/UBOs/sysvals we want with no WC readback and no shenanigans
mensi has quit [Remote host closed the connection]
<alyssa> and apple *also* threw in hw support for preambles becuase they're nice like that
<robclark> from time to time whatever various $vendor gets in a huff about aquarium.. but it is *soo* unrepresentative of reasonable gl workloads
<robclark> it's basically an example of ow not to do gl
<robclark> it's basically glxgears with fish
<robclark> (and I'm vegetarian so idc)
* robclark has some hate for dumb benchmarks ;-)
<alyssa> lololol
<alyssa> well, this mess started (in part) with Dolphin
<alyssa> which is a very different example of how not to do gl :~P
<HdkR> It's a whole collection of real games though :P
<alyssa> yes, that's the problem :~)