ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
<icecream95>
There are a couple of application bugs "fixed" by disabling support for multiple batches, though, so it is used sometimes
<anarsoul>
icecream95: well, we've got nice perf boost in most games in lima once support for multiple jobs was added
<anarsoul>
flush/reload is very expensive on utgard. Maybe not so expensive on newer Malis
`join_subline has quit [Ping timeout: 480 seconds]
atler is now known as Guest435
atler has joined #panfrost
Guest435 has quit [Ping timeout: 480 seconds]
rkanwal1 has quit []
rkanwal has joined #panfrost
rkanwal has quit [Ping timeout: 480 seconds]
Daanct12 has joined #panfrost
floof58 has quit [Ping timeout: 480 seconds]
floof58 has joined #panfrost
Rathann has quit [Ping timeout: 480 seconds]
camus1 has joined #panfrost
camus has quit [Read error: Connection reset by peer]
q4a has quit [Quit: Page closed]
Rathann has joined #panfrost
nlhowell has joined #panfrost
rasterman has joined #panfrost
narmstrong_ has quit []
narmstrong has joined #panfrost
simon-perretta-img has quit [Ping timeout: 480 seconds]
simon-perretta-img has joined #panfrost
rasterman has quit [Quit: Gettin' stinky!]
alpernebbi has quit [Remote host closed the connection]
alpernebbi has joined #panfrost
rkanwal has joined #panfrost
icecream95 has quit [Ping timeout: 480 seconds]
<alyssa>
So I've spent too many cycles trying to figure out how to do dependency tracking for Bifrost/Valhall
<alyssa>
Varying loads are an especially weird case due to one weird little edge case
<alyssa>
A varying load with .store on Bifrost or Valhall has to wait for all other varying loads *in the quad* to complete
<alyssa>
Not in the thread, like you'd get from a simple logical dataflow analysis pass that would track the latest varying loads
<alyssa>
but in the quad
<alyssa>
meaning we have to wait for instructions that don't logically execute, if they.. could execute
<alyssa>
The bad case looks something like:
<alyssa>
if (dynamic) {
<alyssa>
x = ld_var()
<alyssa>
} else {
<alyssa>
x = ld_var()
<alyssa>
}
<alyssa>
Logically, a given thread executes only a single ld_var instruction
<alyssa>
Only one side of the if is taken for a given thread
<alyssa>
and if all threads in the quad branch the same way (there is no divergence in the quad), that's true
<alyssa>
But if there is divergence within the quad -- neighbouring pixels take different sides of the if --
<alyssa>
the second ld_var has to wait for the results fo the first ld_var
<alyssa>
Other compilers like ACO handle this by maintaining *two* control flow graphs
<alyssa>
the "logical" control flow graph, corresponding to the control flow from a single thread (what you'd get on a CPU),
<alyssa>
and the "physical" control flow graph, corresponding to what the entire subgroup ("warp" or "wave") executes
<alyssa>
so the physical control flow graph would have an edge connecting the two ld_vars, since they are physically connected
<alyssa>
and then you'd do the dataflow analysis on that physical CFG instead
<alyssa>
Of course that's complicated and almost nowhere else in the compiler is a physical CFG called for and I'd rather not deal with that, thanks :-p
<alyssa>
The band-aid fix is to wait for outstanding ld_vars at the end of each block
<alyssa>
That would only hurt perf in two cases:
<alyssa>
1. ld_var is used as a phi, like in the bad case, and the block is short enough (but the use in the successor far enough away) that this introduces a stall
<alyssa>
2. ld_var executes in a block that dominates its first use (i.e. ld_var before an if/else, used after the if/else).
<alyssa>
Both cases are possible, of course
<alyssa>
But also both are rare due to various aggressive optimizations we do (peephole select, in the future code motion)
<alyssa>
and varying loads are probably not super expensive? so shrug?
<alyssa>
(in terms of added stalls)
<alyssa>
For a first pass approach to sidestep all the tricky data flow questions, this seems like a decent compromise.
<alyssa>
can deal with the extra nonsense if we actually encounter a workload that needs it etc
<alyssa>
and then can move onto more interesting things... like making the code gen for ld_vary not suck so much :-p
Daanct12 has quit [Ping timeout: 480 seconds]
<alyssa>
I also suspect most shaders that are hurt by this need some other more powerful optimization
<alyssa>
(Namely code motion)
<alyssa>
and that's probably more interesting ...
Daanct12 has joined #panfrost
camus1 has quit [Remote host closed the connection]
camus has joined #panfrost
MajorBiscuit has joined #panfrost
Danct12 has quit [Remote host closed the connection]
MajorBiscuit has quit [Quit: WeeChat 3.4]
Daanct12 has quit [Read error: Connection reset by peer]
Daanct12 has joined #panfrost
camus has quit [Remote host closed the connection]
Daanct12 has quit [Read error: Connection reset by peer]
camus has joined #panfrost
Daanct12 has joined #panfrost
Daaanct12 has joined #panfrost
Daanct12 has quit [Ping timeout: 480 seconds]
Daaanct12 has quit [Remote host closed the connection]
camus has quit [Read error: Connection reset by peer]
camus has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
camus has joined #panfrost
<robclark>
alyssa: the cheezy answer is hoist the loads into the first block.. but having to track phys successors isn't that bad, ir3 does it too
<alyssa>
robclark: Yeah, it's a not a huge deal either way.. For now will insert waits at the ends of the blocks with a comment to revisit if that's a problem in practice
<alyssa>
Bigger fish to fry right now
erle has quit [Ping timeout: 480 seconds]
erle has joined #panfrost
erle has quit []
<alyssa>
Grumble
<alyssa>
Mali has a fused varying + texture instruction ("VAR_TEX"), but the hardware only supports 32-bit texture coordinates
<alyssa>
So we can fuse LD_VAR.f32 + TEX
<alyssa>
but if there are mediump texture coordinates, and we allow 16-bit varyings (which we really want to for perf), we end up with
<alyssa>
LD_VAR.f16 + F16_TO_F32 + TEX
<alyssa>
It's not so hard to chew through that. But it's wrong!
<alyssa>
"LD_VAR.f16 + F16_TO_F32 + TEX" is very much different than "LD_VAR.f32 + TEX"
<alyssa>
If it were an f2fmp in the middle, it'd be fine. And originally it was! But by the time we get to the backend, the f2fmp is lowered to f2f16.
<alyssa>
So our options are:
<alyssa>
1. Don't lower f2fmp in NIR, propagate the mediump through the backend IR
<alyssa>
2. Teach lower_mediump_io to avoid inserting f2fmp we won't be able to chew through
<alyssa>
3. Add a NIR pass to undo the lower_mediump_io where it's harmful for forming VAR_TEX
<alyssa>
4. Add NIR intrinsics for VAR_TEX and move the entire optimization to a NIR pass.
<alyssa>
#3 is attractive as it avoids any common NIR changes
<alyssa>
#1 sounds awful because even without the f2fmp/f2f16 distinction, chewing through complex expressions is easier in NIR than the backend
<alyssa>
(The opt pass that fuses VAR_TEX was originally built for propagating fsat...)
<alyssa>
#2 and #3 sound like they would need some delicate heuristics that might get out of sync..
<alyssa>
though maybe not..?
<alyssa>
very much machine-specific (VAR_TEX in Bifrost is different than VAR_TEX in Valhall) but well-defined rules, no heuristic needed
<alyssa>
so #3 avoids changing either NIR *or* the backend, which is nice..
<alyssa>
I think that'd be my preference then
Rathann has quit [Ping timeout: 480 seconds]
<alyssa>
Ended up going with #2 since it's easy to approximate with the existing lower_mediump_io interfface
<alyssa>
and in practice does well
<alyssa>
with v little code
<alyssa>
the good news is now we get optimal code gen on Bifrost for `gl_FragColor = v_color`
<alyssa>
Valhall needs a bit more work to get there
<alyssa>
but this was step #1 for valhall
greenjustin has quit [Remote host closed the connection]
alyssa has quit [Quit: So long and thanks for all the pineapples]
<CounterPillow>
Oh, flatpak pulled in an old mesa. Let's try again natively with 22.1.0
<CounterPillow>
welp, 22.1.0 fares even worse, just a black screen (resolved by LIBGL_ALWAYS_SOFTWARE=1 so presumably a driver bug). Will write up all the reproduction steps and submit a bug report I guess
rasterman has joined #panfrost
rkanwal has quit [Ping timeout: 480 seconds]
technopoirot has joined #panfrost
<technopoirot>
is there such a thing as a Mali-G31 MP1? or is G31 exclusively MP2?
camus has quit [Remote host closed the connection]
camus has joined #panfrost
<xdarklight>
technopoirot: according to the "L2 Cache" section from https://developer.arm.com/Processors/Mali-G31 there is an MP1 (that doesn't mean that anybody ever implemented this though)