ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
<icecream95> There are a couple of application bugs "fixed" by disabling support for multiple batches, though, so it is used sometimes
<anarsoul> icecream95: well, we've got nice perf boost in most games in lima once support for multiple jobs was added
<anarsoul> flush/reload is very expensive on utgard. Maybe not so expensive on newer Malis
`join_subline has quit [Ping timeout: 480 seconds]
atler is now known as Guest435
atler has joined #panfrost
Guest435 has quit [Ping timeout: 480 seconds]
rkanwal1 has quit []
rkanwal has joined #panfrost
rkanwal has quit [Ping timeout: 480 seconds]
Daanct12 has joined #panfrost
floof58 has quit [Ping timeout: 480 seconds]
floof58 has joined #panfrost
Rathann has quit [Ping timeout: 480 seconds]
camus1 has joined #panfrost
camus has quit [Read error: Connection reset by peer]
q4a has quit [Quit: Page closed]
Rathann has joined #panfrost
nlhowell has joined #panfrost
rasterman has joined #panfrost
narmstrong_ has quit []
narmstrong has joined #panfrost
simon-perretta-img has quit [Ping timeout: 480 seconds]
simon-perretta-img has joined #panfrost
rasterman has quit [Quit: Gettin' stinky!]
alpernebbi has quit [Remote host closed the connection]
alpernebbi has joined #panfrost
rkanwal has joined #panfrost
icecream95 has quit [Ping timeout: 480 seconds]
<alyssa> So I've spent too many cycles trying to figure out how to do dependency tracking for Bifrost/Valhall
<alyssa> Varying loads are an especially weird case due to one weird little edge case
<alyssa> A varying load with .store on Bifrost or Valhall has to wait for all other varying loads *in the quad* to complete
<alyssa> Not in the thread, like you'd get from a simple logical dataflow analysis pass that would track the latest varying loads
<alyssa> but in the quad
<alyssa> meaning we have to wait for instructions that don't logically execute, if they.. could execute
<alyssa> The bad case looks something like:
<alyssa> if (dynamic) {
<alyssa> x = ld_var()
<alyssa> } else {
<alyssa> x = ld_var()
<alyssa> }
<alyssa> Logically, a given thread executes only a single ld_var instruction
<alyssa> Only one side of the if is taken for a given thread
<alyssa> and if all threads in the quad branch the same way (there is no divergence in the quad), that's true
<alyssa> But if there is divergence within the quad -- neighbouring pixels take different sides of the if --
<alyssa> the second ld_var has to wait for the results fo the first ld_var
<alyssa> Other compilers like ACO handle this by maintaining *two* control flow graphs
<alyssa> the "logical" control flow graph, corresponding to the control flow from a single thread (what you'd get on a CPU),
<alyssa> and the "physical" control flow graph, corresponding to what the entire subgroup ("warp" or "wave") executes
<alyssa> so the physical control flow graph would have an edge connecting the two ld_vars, since they are physically connected
<alyssa> and then you'd do the dataflow analysis on that physical CFG instead
<alyssa> Of course that's complicated and almost nowhere else in the compiler is a physical CFG called for and I'd rather not deal with that, thanks :-p
<alyssa> The band-aid fix is to wait for outstanding ld_vars at the end of each block
<alyssa> That would only hurt perf in two cases:
<alyssa> 1. ld_var is used as a phi, like in the bad case, and the block is short enough (but the use in the successor far enough away) that this introduces a stall
<alyssa> 2. ld_var executes in a block that dominates its first use (i.e. ld_var before an if/else, used after the if/else).
<alyssa> Both cases are possible, of course
<alyssa> But also both are rare due to various aggressive optimizations we do (peephole select, in the future code motion)
<alyssa> and varying loads are probably not super expensive? so shrug?
<alyssa> (in terms of added stalls)
<alyssa> For a first pass approach to sidestep all the tricky data flow questions, this seems like a decent compromise.
<alyssa> can deal with the extra nonsense if we actually encounter a workload that needs it etc
<alyssa> and then can move onto more interesting things... like making the code gen for ld_vary not suck so much :-p
Daanct12 has quit [Ping timeout: 480 seconds]
<alyssa> I also suspect most shaders that are hurt by this need some other more powerful optimization
<alyssa> (Namely code motion)
<alyssa> and that's probably more interesting ...
Daanct12 has joined #panfrost
camus1 has quit [Remote host closed the connection]
camus has joined #panfrost
MajorBiscuit has joined #panfrost
Danct12 has quit [Remote host closed the connection]
MajorBiscuit has quit [Quit: WeeChat 3.4]
Daanct12 has quit [Read error: Connection reset by peer]
Daanct12 has joined #panfrost
camus has quit [Remote host closed the connection]
Daanct12 has quit [Read error: Connection reset by peer]
camus has joined #panfrost
Daanct12 has joined #panfrost
Daaanct12 has joined #panfrost
Daanct12 has quit [Ping timeout: 480 seconds]
Daaanct12 has quit [Remote host closed the connection]
camus has quit [Read error: Connection reset by peer]
camus has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
camus has joined #panfrost
<robclark> alyssa: the cheezy answer is hoist the loads into the first block.. but having to track phys successors isn't that bad, ir3 does it too
<alyssa> robclark: Yeah, it's a not a huge deal either way.. For now will insert waits at the ends of the blocks with a comment to revisit if that's a problem in practice
<alyssa> Bigger fish to fry right now
erle has quit [Ping timeout: 480 seconds]
erle has joined #panfrost
erle has quit []
<alyssa> Grumble
<alyssa> Mali has a fused varying + texture instruction ("VAR_TEX"), but the hardware only supports 32-bit texture coordinates
<alyssa> So we can fuse LD_VAR.f32 + TEX
<alyssa> but if there are mediump texture coordinates, and we allow 16-bit varyings (which we really want to for perf), we end up with
<alyssa> LD_VAR.f16 + F16_TO_F32 + TEX
<alyssa> It's not so hard to chew through that. But it's wrong!
<alyssa> "LD_VAR.f16 + F16_TO_F32 + TEX" is very much different than "LD_VAR.f32 + TEX"
<alyssa> If it were an f2fmp in the middle, it'd be fine. And originally it was! But by the time we get to the backend, the f2fmp is lowered to f2f16.
<alyssa> So our options are:
<alyssa> 1. Don't lower f2fmp in NIR, propagate the mediump through the backend IR
<alyssa> 2. Teach lower_mediump_io to avoid inserting f2fmp we won't be able to chew through
<alyssa> 3. Add a NIR pass to undo the lower_mediump_io where it's harmful for forming VAR_TEX
<alyssa> 4. Add NIR intrinsics for VAR_TEX and move the entire optimization to a NIR pass.
<alyssa> #3 is attractive as it avoids any common NIR changes
<alyssa> #1 sounds awful because even without the f2fmp/f2f16 distinction, chewing through complex expressions is easier in NIR than the backend
<alyssa> (The opt pass that fuses VAR_TEX was originally built for propagating fsat...)
<alyssa> #2 and #3 sound like they would need some delicate heuristics that might get out of sync..
<alyssa> though maybe not..?
<alyssa> very much machine-specific (VAR_TEX in Bifrost is different than VAR_TEX in Valhall) but well-defined rules, no heuristic needed
<alyssa> so #3 avoids changing either NIR *or* the backend, which is nice..
<alyssa> I think that'd be my preference then
Rathann has quit [Ping timeout: 480 seconds]
<alyssa> Ended up going with #2 since it's easy to approximate with the existing lower_mediump_io interfface
<alyssa> and in practice does well
<alyssa> with v little code
<alyssa> the good news is now we get optimal code gen on Bifrost for `gl_FragColor = v_color`
<alyssa> Valhall needs a bit more work to get there
<alyssa> but this was step #1 for valhall
greenjustin has quit [Remote host closed the connection]
alyssa has quit [Quit: So long and thanks for all the pineapples]
<CounterPillow> I think bifrost might have some issues with mupen https://0x0.st/oBKd.png
<HdkR> I see nothing wrong here
<CounterPillow> Oh, flatpak pulled in an old mesa. Let's try again natively with 22.1.0
<CounterPillow> welp, 22.1.0 fares even worse, just a black screen (resolved by LIBGL_ALWAYS_SOFTWARE=1 so presumably a driver bug). Will write up all the reproduction steps and submit a bug report I guess
rasterman has joined #panfrost
rkanwal has quit [Ping timeout: 480 seconds]
technopoirot has joined #panfrost
<technopoirot> is there such a thing as a Mali-G31 MP1? or is G31 exclusively MP2?
camus has quit [Remote host closed the connection]
camus has joined #panfrost
<xdarklight> technopoirot: according to the "L2 Cache" section from https://developer.arm.com/Processors/Mali-G31 there is an MP1 (that doesn't mean that anybody ever implemented this though)
camus1 has joined #panfrost
camus has quit [Ping timeout: 480 seconds]