ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard + Bifrost + Valhall - Logs https://oftc.irclog.whitequark.org/panfrost - I don't know anything about WSI. That's my story and I'm sticking to it.
floof58 is now known as Guest2062
floof58 has joined #panfrost
Guest2062 has quit [Ping timeout: 480 seconds]
rasterman has quit [Quit: Gettin' stinky!]
kinkinkijkin has quit [Quit: Leaving]
floof58 is now known as Guest2069
floof58 has joined #panfrost
Guest2069 has quit [Ping timeout: 480 seconds]
Leopold__ has quit [Remote host closed the connection]
Leopold has joined #panfrost
atler is now known as Guest2071
atler has joined #panfrost
Guest2071 has quit [Ping timeout: 480 seconds]
Leopold has quit [Remote host closed the connection]
Leopold has joined #panfrost
<alyssa>
hacked together support for nir_opt_preamble on v9
<alyssa>
aka "pilot shaders"
<alyssa>
so far it's a loss on every workload I've tried :|
<HdkR>
:|
<alyssa>
so either I've screwed something up terribly or this isn't viable.
<alyssa>
1x1x1 compute kernels are not good for gpus.
* alyssa
wonders what heuristic Arm uses
<alyssa>
some workload perf halved, not even like a 2% hit or something
<alyssa>
perf counters are less useful than I would have liked
<alyssa>
all of this concerns me because it has serious implications for vulkan performance
<alyssa>
especially when you throw stuff like update-after-bind into the equation
<alyssa>
though maybe if we just make liberal use of the CSF it'd work out ...
alarumbe has quit [Read error: Connection reset by peer]
<alyssa>
on the GL side this rules out being able to do threaded context efficiently, I guess :|
alarumbe has joined #panfrost
cphealy has quit [Quit: Leaving]
<alyssa>
looking at something like Dolphin (last weekend's test) ... yeah, end up a bit slower than the GPU copy shader
<alyssa>
and even slower than just disabling push of UBOs
<alyssa>
but much faster than trying to push UBOs on the CPU
<alyssa>
wait but this isn't a release build uh
* alyssa
doesn't expect substantially different results
<alyssa>
hum
davidlt has joined #panfrost
Leopold has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
rasterman has joined #panfrost
<rah>
alyssa: OK thanks
Dr_Who has quit [Read error: Connection reset by peer]
rasterman has quit [Quit: Gettin' stinky!]
Leopold_ has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
davidlt has quit [Ping timeout: 480 seconds]
kinkinkijkin has joined #panfrost
davidlt has joined #panfrost
kinkinkijkin has quit [Read error: Connection reset by peer]
pendingchaos has quit [Ping timeout: 480 seconds]
Leopold___ has joined #panfrost
Leopold_ has quit [Ping timeout: 480 seconds]
hanetzer has joined #panfrost
Leopold___ has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
<alyssa>
so, as I was going to bed last night I had a nagging suspicion that the GPU is doing some sort of funny compute vs tiler barriers
<alyssa>
so I wrote a stupid-simple patch just now to reorder the job chain I'm emitting for preambles, promoting all the compute kernels to the front of the chain
<alyssa>
so now for 3 draws that each have a preamble for their VS, it looks like
<alyssa>
this more than doubles perf on draw call heavy tests
<alyssa>
still a loss compared to no preambles on e.g. webgl aquarium at 5000 fish
<alyssa>
but definitely getting there
<alyssa>
oh but we're falling down the too many draws path of doom
<alyssa>
ok, yeah, bumping the limit avoids an extra flush in there
<alyssa>
so now auqarium at 5000 fish with preambles is up to ~37fps and without preambles was at ~41fps
<alyssa>
(this is all at 1080p on an MT8192 chromebook in Chromium with the GLES renderer on top of panfrost/next which has some optimizations I added the past 2 weeks that haven't made their way upstream yet)
<alyssa>
(aquarium will be slower on your panfrost)
<alyssa>
reusing resource tables between preamble + main shader gets me an extra fps back
<alyssa>
(the JS1_WAIT_READ counter lit up -- descriptor reads were a hot spot)
<alyssa>
this is despite -bideas having some nontrivial preambles (the sort you might hope would benefit)
<alyssa>
hey, still progress.
<alyssa>
will revisit this some other time
pendingchaos has joined #panfrost
nlhowell has joined #panfrost
nlhowell has quit [Ping timeout: 480 seconds]
<hays>
cool
davidlt has quit [Ping timeout: 480 seconds]
falk689_ has quit [Remote host closed the connection]
falk689 has joined #panfrost
falk689 has quit [Remote host closed the connection]
falk689 has joined #panfrost
<robclark>
webgl aquarium is really just a uniform upload benchmark
Leopold_ has quit [Ping timeout: 480 seconds]
Leopold_ has joined #panfrost
falk689 has quit [Remote host closed the connection]
falk689 has joined #panfrost
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #panfrost
mensi has joined #panfrost
<alyssa>
robclark: yep.
<alyssa>
which means if I'm getting a 10% hit on it when optimizing my uniform upload code I didn't do so hot :-p
<robclark>
I don't think preambles help webgl aquarium.. and also webgl aquarium is not a representative workload (aka benchmark) ;-)
<robclark>
maybe it matters for tuning for threshold of when to bother w/ preamble
<robclark>
ie. you shouldn't for aquarium
<alyssa>
robclark: the actual motivation here is that Mali only lets us push a single contiguous range of GPU memory
Leopold_ has quit [Ping timeout: 480 seconds]
<alyssa>
which means there has to be a gather operation to combine uniforms/driver sysvals/UBOs/whatever into a contiguous buffer
<alyssa>
currently that happens on the CPU
<alyssa>
which means perf nosedives when UBOs are used heavily because of all the readback from uncached (wc) UBO memory
<alyssa>
and also complicates Vulkan because stuff like txs is currently handled as magic indexed sysvals
<alyssa>
thats ugly in gles but workable, it will get a lot worse when descriptor sets are added in, and even worse for bindless
<alyssa>
The siren song of preambles is offering a neat solution to all of this and more
<alyssa>
Pushing UBOs without the WC readback pain will require a compute kernel on the GPU to do the gather operation (unless you only read a single contiguous range from a single UBO with no driver sysvals)
<alyssa>
If you're going to need a compute kernel for the copy, might as well go all in and let nir_opt_preamble make the decisions of what to push (it will make better decisions than our dumb as rocks backend pass)
<mensi>
hi! Is this conversation from a month ago: https://oftc.irclog.whitequark.org/panfrost/2022-12-20#31734646 still accurate and rusticl is not yet expected to work with panfrost? I managed to compile and run everything from git and got clinfo to use rusticl but it listed 0 devices
<alyssa>
correct
<alyssa>
and then we can lower txs to "load descriptor from memory and extract the bits" and still get the answer in a uniform so we get a single txs implementation that doesn't care what binding model is used and doesn't require the cpu to do stupid things
<alyssa>
and as a bonus we *also* get uniform-on-uniform arithmetic optimized
Leopold_ has joined #panfrost
<alyssa>
(the ISA assumes that there's no uniform-on-uniform arithmetic, requiring extra moves for that case, there's no scalar unit or anything like that)
<alyssa>
Unfortunately the extra GPU draw overhead from all the compute kernels is significant :|
<alyssa>
especially for gles2 apps where everything is on the CPU happy path and there's no WC readback
<alyssa>
(but that means I can't wire up threaded_context, and I'm not sure I can afford not to wire up tc..)
<alyssa>
my last ditch idea is to use nir_opt_preamble to make the decisions of WHAT to push, but then optimize out "store_preamble(load_uniform)" statements from the preamble shader (running them on the CPU with our traditional CPU gather code)
<alyssa>
which eliminates the compute kernel except when the app actually does uniform-on-uniform arithmetic or reads from UBOs or does txs if we go that lowering route
<alyssa>
which would probably avoid regressing gles2 apps
<alyssa>
(like aquarium)
<alyssa>
of course, that's no good if we flip on tc, because then EVERYTHING is a UBO
<alyssa>
and the only time that would let us save the compute kernel is if the app doesn't use any GL uniforms or UBOs
<HdkR>
Time to switch everything to SSBOs
<alyssa>
blink
<alyssa>
nir_opt_preamble can probably push readonly SSBOs ;)
<alyssa>
I also suspect that the preamble overhead is a lot lower on v10
<alyssa>
and if we only support v10 (and newer) for vulkan, then .. I guess we use preambles aggressively on vulkan and leave gl with the traditional path
<alyssa>
that complicates the ABI but meh, trying to do preambles everywhere means writing greenfield Midgard compiler code so maybe not the end of the world :~P
falk689 has quit [Remote host closed the connection]
falk689 has joined #panfrost
<robclark>
alyssa: aquarium is, iirc, roughly: for (i in 0 .. nfish) { upload vec4 color; glDraw(); }
<robclark>
so only reason to look at it for opt is to make sure your heuristics don't pick a dumb path for a dumb case ;-)
<alyssa>
fair enough :p
<alyssa>
robclark: what's frustrating is that Arm could have avoided this whole mess with a small hw change
<alyssa>
and yet
<robclark>
repeat after me: glxgears^Daquarium is not a benchmark ;-)
<alyssa>
heh
<alyssa>
on AGX the driver gets to map as many ranges of GPU memory as it wishes to uniform registers
<alyssa>
which means we can push whatever uniforms/UBOs/sysvals we want with no WC readback and no shenanigans
mensi has quit [Remote host closed the connection]
<alyssa>
and apple *also* threw in hw support for preambles becuase they're nice like that
<robclark>
from time to time whatever various $vendor gets in a huff about aquarium.. but it is *soo* unrepresentative of reasonable gl workloads
<robclark>
it's basically an example of ow not to do gl
<robclark>
it's basically glxgears with fish
<robclark>
(and I'm vegetarian so idc)
* robclark
has some hate for dumb benchmarks ;-)
<alyssa>
lololol
<alyssa>
well, this mess started (in part) with Dolphin
<alyssa>
which is a very different example of how not to do gl :~P
<HdkR>
It's a whole collection of real games though :P