#panfrost on 2022-05-27 — irc logs at oftc.irclog.whitequark.org

2022-03-22 11:57 ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular

00:00 <icecream95> There are a couple of application bugs "fixed" by disabling support for multiple batches, though, so it is used sometimes

00:02 <anarsoul> icecream95: well, we've got nice perf boost in most games in lima once support for multiple jobs was added

00:03 <anarsoul> flush/reload is very expensive on utgard. Maybe not so expensive on newer Malis

00:41 `join_subline has quit [Ping timeout: 480 seconds]

01:07 atler is now known as Guest435

01:07 atler has joined #panfrost

01:09 Guest435 has quit [Ping timeout: 480 seconds]

01:19 rkanwal1 has quit []

01:19 rkanwal has joined #panfrost

02:03 rkanwal has quit [Ping timeout: 480 seconds]

02:21 Daanct12 has joined #panfrost

02:30 floof58 has quit [Ping timeout: 480 seconds]

02:33 floof58 has joined #panfrost

02:36 Rathann has quit [Ping timeout: 480 seconds]

03:01 camus1 has joined #panfrost

03:03 camus has quit [Read error: Connection reset by peer]

04:10 q4a has quit [Quit: Page closed]

06:00 Rathann has joined #panfrost

06:50 nlhowell has joined #panfrost

06:59 rasterman has joined #panfrost

07:37 narmstrong_ has quit []

07:37 narmstrong has joined #panfrost

08:38 simon-perretta-img has quit [Ping timeout: 480 seconds]

09:00 simon-perretta-img has joined #panfrost

09:35 rasterman has quit [Quit: Gettin' stinky!]

10:15 alpernebbi has quit [Remote host closed the connection]

10:20 alpernebbi has joined #panfrost

10:48 rkanwal has joined #panfrost

11:18 icecream95 has quit [Ping timeout: 480 seconds]

11:37 <alyssa> So I've spent too many cycles trying to figure out how to do dependency tracking for Bifrost/Valhall

11:37 <alyssa> Varying loads are an especially weird case due to one weird little edge case

11:38 <alyssa> A varying load with .store on Bifrost or Valhall has to wait for all other varying loads *in the quad* to complete

11:38 <alyssa> Not in the thread, like you'd get from a simple logical dataflow analysis pass that would track the latest varying loads

11:38 <alyssa> but in the quad

11:38 <alyssa> meaning we have to wait for instructions that don't logically execute, if they.. could execute

11:39 <alyssa> The bad case looks something like:

11:39 <alyssa> if (dynamic) {

11:39 <alyssa> x = ld_var()

11:39 <alyssa> } else {

11:39 <alyssa> x = ld_var()

11:39 <alyssa> }

11:39 <alyssa> Logically, a given thread executes only a single ld_var instruction

11:39 <alyssa> Only one side of the if is taken for a given thread

11:40 <alyssa> and if all threads in the quad branch the same way (there is no divergence in the quad), that's true

11:40 <alyssa> But if there is divergence within the quad -- neighbouring pixels take different sides of the if --

11:40 <alyssa> the second ld_var has to wait for the results fo the first ld_var

11:41 <alyssa> Other compilers like ACO handle this by maintaining *two* control flow graphs

11:41 <alyssa> the "logical" control flow graph, corresponding to the control flow from a single thread (what you'd get on a CPU),

11:41 <alyssa> and the "physical" control flow graph, corresponding to what the entire subgroup ("warp" or "wave") executes

11:42 <alyssa> so the physical control flow graph would have an edge connecting the two ld_vars, since they are physically connected

11:42 <alyssa> and then you'd do the dataflow analysis on that physical CFG instead

11:42 <alyssa> Of course that's complicated and almost nowhere else in the compiler is a physical CFG called for and I'd rather not deal with that, thanks :-p

11:43 <alyssa> The band-aid fix is to wait for outstanding ld_vars at the end of each block

11:43 <alyssa> That would only hurt perf in two cases:

11:44 <alyssa> 1. ld_var is used as a phi, like in the bad case, and the block is short enough (but the use in the successor far enough away) that this introduces a stall

11:44 <alyssa> 2. ld_var executes in a block that dominates its first use (i.e. ld_var before an if/else, used after the if/else).

11:45 <alyssa> Both cases are possible, of course

11:45 <alyssa> But also both are rare due to various aggressive optimizations we do (peephole select, in the future code motion)

11:46 <alyssa> and varying loads are probably not super expensive? so shrug?

11:46 <alyssa> (in terms of added stalls)

11:46 <alyssa> For a first pass approach to sidestep all the tricky data flow questions, this seems like a decent compromise.

11:47 <alyssa> can deal with the extra nonsense if we actually encounter a workload that needs it etc

11:48 <alyssa> and then can move onto more interesting things... like making the code gen for ld_vary not suck so much :-p

11:49 Daanct12 has quit [Ping timeout: 480 seconds]

11:51 <alyssa> I also suspect most shaders that are hurt by this need some other more powerful optimization

11:51 <alyssa> (Namely code motion)

11:51 <alyssa> and that's probably more interesting ...

12:26 Daanct12 has joined #panfrost

12:35 camus1 has quit [Remote host closed the connection]

12:35 camus has joined #panfrost

12:48 MajorBiscuit has joined #panfrost

13:26 Danct12 has quit [Remote host closed the connection]

13:27 MajorBiscuit has quit [Quit: WeeChat 3.4]

13:31 Daanct12 has quit [Read error: Connection reset by peer]

13:31 Daanct12 has joined #panfrost

13:51 camus has quit [Remote host closed the connection]

13:51 Daanct12 has quit [Read error: Connection reset by peer]

13:52 camus has joined #panfrost

13:52 Daanct12 has joined #panfrost

14:26 Daaanct12 has joined #panfrost

14:32 Daanct12 has quit [Ping timeout: 480 seconds]

14:36 Daaanct12 has quit [Remote host closed the connection]

14:51 camus has quit [Read error: Connection reset by peer]

14:57 camus has joined #panfrost

15:05 camus has quit [Ping timeout: 480 seconds]

15:18 camus has joined #panfrost

15:37 <robclark> alyssa: the cheezy answer is hoist the loads into the first block.. but having to track phys successors isn't that bad, ir3 does it too

15:49 <alyssa> robclark: Yeah, it's a not a huge deal either way.. For now will insert waits at the ends of the blocks with a comment to revisit if that's a problem in practice

15:50 <alyssa> Bigger fish to fry right now

16:55 erle has quit [Ping timeout: 480 seconds]

17:01 erle has joined #panfrost

17:10 erle has quit []

18:20 <alyssa> Grumble

18:20 <alyssa> Mali has a fused varying + texture instruction ("VAR_TEX"), but the hardware only supports 32-bit texture coordinates

18:21 <alyssa> So we can fuse LD_VAR.f32 + TEX

18:21 <alyssa> but if there are mediump texture coordinates, and we allow 16-bit varyings (which we really want to for perf), we end up with

18:21 <alyssa> LD_VAR.f16 + F16_TO_F32 + TEX

18:21 <alyssa> It's not so hard to chew through that. But it's wrong!

18:22 <alyssa> "LD_VAR.f16 + F16_TO_F32 + TEX" is very much different than "LD_VAR.f32 + TEX"

18:22 <alyssa> If it were an f2fmp in the middle, it'd be fine. And originally it was! But by the time we get to the backend, the f2fmp is lowered to f2f16.

18:23 <alyssa> So our options are:

18:23 <alyssa> 1. Don't lower f2fmp in NIR, propagate the mediump through the backend IR

18:24 <alyssa> 2. Teach lower_mediump_io to avoid inserting f2fmp we won't be able to chew through

18:24 <alyssa> 3. Add a NIR pass to undo the lower_mediump_io where it's harmful for forming VAR_TEX

18:24 <alyssa> 4. Add NIR intrinsics for VAR_TEX and move the entire optimization to a NIR pass.

18:26 <alyssa> #3 is attractive as it avoids any common NIR changes

18:27 <alyssa> #1 sounds awful because even without the f2fmp/f2f16 distinction, chewing through complex expressions is easier in NIR than the backend

18:27 <alyssa> (The opt pass that fuses VAR_TEX was originally built for propagating fsat...)

18:27 <alyssa> #2 and #3 sound like they would need some delicate heuristics that might get out of sync..

18:29 <alyssa> though maybe not..?

18:29 <alyssa> very much machine-specific (VAR_TEX in Bifrost is different than VAR_TEX in Valhall) but well-defined rules, no heuristic needed

18:30 <alyssa> so #3 avoids changing either NIR *or* the backend, which is nice..

18:30 <alyssa> I think that'd be my preference then

18:31 Rathann has quit [Ping timeout: 480 seconds]

19:40 <alyssa> Ended up going with #2 since it's easy to approximate with the existing lower_mediump_io interfface

19:41 <alyssa> and in practice does well

19:41 <alyssa> with v little code

20:20 <alyssa> the good news is now we get optimal code gen on Bifrost for `gl_FragColor = v_color`

20:20 <alyssa> Valhall needs a bit more work to get there

20:20 <alyssa> but this was step #1 for valhall

20:33 greenjustin has quit [Remote host closed the connection]

20:37 alyssa has quit [Quit: So long and thanks for all the pineapples]

20:37 <CounterPillow> I think bifrost might have some issues with mupen https://0x0.st/oBKd.png

20:39 <HdkR> I see nothing wrong here

20:40 <CounterPillow> Oh, flatpak pulled in an old mesa. Let's try again natively with 22.1.0

21:12 <CounterPillow> welp, 22.1.0 fares even worse, just a black screen (resolved by LIBGL_ALWAYS_SOFTWARE=1 so presumably a driver bug). Will write up all the reproduction steps and submit a bug report I guess

21:24 rasterman has joined #panfrost

21:42 rkanwal has quit [Ping timeout: 480 seconds]

21:42 technopoirot has joined #panfrost

21:44 <technopoirot> is there such a thing as a Mali-G31 MP1? or is G31 exclusively MP2?

22:32 camus has quit [Remote host closed the connection]

22:33 camus has joined #panfrost

22:36 <xdarklight> technopoirot: according to the "L2 Cache" section from https://developer.arm.com/Processors/Mali-G31 there is an MP1 (that doesn't mean that anybody ever implemented this though)

23:41 camus1 has joined #panfrost

23:46 camus has quit [Ping timeout: 480 seconds]