#panfrost on 2022-05-16 — irc logs at oftc.irclog.whitequark.org

2022-03-22 11:57 ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular

00:06 digetx has quit [Remote host closed the connection]

00:10 camus has quit [Remote host closed the connection]

00:11 camus has joined #panfrost

00:11 digetx has joined #panfrost

00:13 camus has quit [Remote host closed the connection]

00:15 camus has joined #panfrost

00:26 camus has quit [Remote host closed the connection]

00:27 camus has joined #panfrost

00:43 camus has quit [Remote host closed the connection]

00:43 camus has joined #panfrost

00:54 camus1 has joined #panfrost

00:58 camus has quit [Ping timeout: 480 seconds]

01:00 Daanct12 has joined #panfrost

01:00 camus1 has quit [Remote host closed the connection]

01:01 camus has joined #panfrost

01:08 camus has quit [Remote host closed the connection]

01:09 camus has joined #panfrost

01:23 Daanct12 has quit [Quit: Leaving]

01:25 Daanct12 has joined #panfrost

01:54 icecream95 has joined #panfrost

02:52 <alyssa> ndufresne: I do believe the partial render mechanism is firmware (or even kernel) triggered, not hardware triggered

02:52 <alyssa> so no fault needed

02:52 <alyssa> I assume Apple has the relevant stuff in their stack

02:52 <alyssa> but the partial render infrastructure is but a small piece of that puzzle

03:16 hexdump0815 has joined #panfrost

03:17 hexdump01 has quit [Ping timeout: 480 seconds]

04:12 Daaanct12 has joined #panfrost

04:15 Daanct12 has quit [Ping timeout: 480 seconds]

06:34 guillaume_g has joined #panfrost

06:49 <macc24> ndufresne: i think there's a way

07:29 nlhowell has joined #panfrost

07:35 Daaanct12 is now known as Daanct12

07:51 Pu244 has quit [Ping timeout: 480 seconds]

08:14 rasterman has joined #panfrost

08:16 <icecream95> HdkR: So now I have two different GCC versions installed to make compiling FEX with Clang work?

08:17 <HdkR> icecream95: Hm?

08:19 <icecream95> HdkR: Debian bookworm seems to be in the middle of moving from GCC 11 to 12, and so Clang needs C++ headers from GCC 12 but /usr/bin/gcc is still GCC 11

08:21 <HdkR> Broken clang environment?

08:23 <icecream95> And now it failed to compile: FEXLogServer/Main.cpp:125:24: error: no matching member function for call to 'erase'

08:24 <HdkR> That's pretty bad if it can't even find the erase function from std::unordered_map

08:25 <icecream95> Seems that it thinks the FDs set is const

08:27 <HdkR> That's weirdly broken

08:29 pi_ has joined #panfrost

08:33 anarsoul|2 has joined #panfrost

08:33 anarsoul has quit [Read error: Connection reset by peer]

08:35 <icecream95> __erase_nodes_if(_Container& __cont, const _UnsafeContainer& __ucont, _Predicate __pred)

08:35 <icecream95> The header seems to be broken, __ucont shouldn't be const

08:35 * icecream95 changes the header to always use __cont rather than __ucont

08:38 <icecream95> Hmm.. the bug appears to still be in GCC upstream?

08:46 <icecream95> And remove_if is also broken, but in a different way?

08:48 MajorBiscuit has joined #panfrost

08:48 MajorBiscuit has quit []

08:48 MajorBiscuit has joined #panfrost

08:49 MajorBiscuit has quit []

08:49 * icecream95 instead changes the code to explicitly iterate over PIDs with a for loop

09:05 icecream95 has quit [Ping timeout: 480 seconds]

09:06 MajorBiscuit has joined #panfrost

09:09 rkanwal has joined #panfrost

09:24 nlhowell has quit [Ping timeout: 480 seconds]

09:25 icecream95 has joined #panfrost

09:33 <icecream95> HdkR: openssl sha256 is 20x slower than native, FEX is so terrible noone should use it! /s

09:34 <icecream95> (Next release: algebraic optimisations replace code implementing SHA with the corresponding AArch64 instruction)

09:54 <HdkR> icecream95: Did you use latest main?

09:57 <HdkR> icecream95: Also is your CPU new enough to support RCPC? Also did you make sure to compile Release + LTO?

10:01 nlhowell has joined #panfrost

10:02 <HdkR> Also if your native CPU supports the SHA256 extensions it is a bit unfair since FEX doesn't support the SHA256 x86 instructions yet :P

10:09 Daanct12 has quit [Ping timeout: 480 seconds]

10:10 icecream95 has quit [Ping timeout: 480 seconds]

10:17 icecream95 has joined #panfrost

10:18 <icecream95> HdkR: It's not so bad for other programs, I deliberately chose an unfair test

10:19 <HdkR> hah

10:20 <HdkR> In main we just recently fixed RCPC which can make things slightly faster, but it also makes the non-TSO path more stable. Can give like a free 3x perf boost if the program is sane.

10:20 <icecream95> Cool, so 1.5x native performance!

10:21 <HdkR> You say that, but we've actually had some instances where we were running emulated code faster than native.

10:26 nlhowell is now known as Guest298

10:26 nlhowell has joined #panfrost

10:27 Guest298 has quit [Ping timeout: 480 seconds]

10:44 nlhowell is now known as Guest300

10:44 nlhowell has joined #panfrost

10:45 Guest300 has quit [Ping timeout: 480 seconds]

11:04 nlhowell has quit [Ping timeout: 480 seconds]

11:36 pjakobsson has quit [Remote host closed the connection]

12:03 <alyssa> HdkR: :D

12:04 <alyssa> I can see that if the native code was built with -O0 or something :p

12:09 <robmur01> nah, it's like a return to the mid-'90s when people were buying DEC Alphas to run their 16-bit x86 code under Windows faster than Pentiums could :D

12:15 <macc24> o_o

12:15 <icecream95> Looking through an old chroot from 2017... Anyone here remember the chromebook-setup.sh script for installing a blobby debian on the old Samsung chromebooks and C201?

12:17 <icecream95> I installed Debian Jessie for that because I thought that 4.9 (which the script would download if cross-compiling) was the only GCC version which could compile the kernel

12:18 <macc24> icecream95: blobby debian?

12:18 <macc24> 'old samsung chromebooks'... exynos machines?

12:19 <alyssa> icecream95: ....snow?

12:19 <alyssa> I have heard horrors about that machine

12:19 <icecream95> alyssa: Exynos 5250, 5420, 5422, which includes snow

12:20 <alyssa> Scary

12:20 <macc24> icecream95: you mean this script? https://github.com/eballetbo/chromebooks/blob/main/chromebook-setup.sh

12:22 <icecream95> macc24: Nope, that's diff... wait a second that looks very similar, it must be a newer version of the script

12:22 <icecream95> But it's $THE_CURRENT_YEAR, why is the script still being updated?

12:23 <macc24> icecream95: because people maintain their stuff

12:23 <macc24> *hides her stuff*

12:23 <macc24> usually

12:23 <alyssa> icecream95: by the way, I spent most of last week thinking through what we want long-term out of bifrost/valhall RA

12:23 <alyssa> I'm still on the fence

12:24 <icecream95> macc24: Wait that script doesn't use the blob, so I guess it's fine

12:24 <macc24> icecream95: what blob?

12:24 * icecream95 pushes alyssa off the fence

12:24 <alyssa> In the short term, I think I convinced myself I want the nodearray patches, at least in the short/medium term

12:24 <icecream95> macc24: mali

12:24 * macc24 shivers

12:25 <alyssa> (At least for the interference matrix. I haven't thought through the liveness analysis side of it.)

12:25 <alyssa> graph colouring sucks, linear scan sucks, lcra sucks, ssa-based ra sucks, it all sucks

12:25 <alyssa> Pick a poison..

12:25 <icecream95> I think there were fairly equal performance gains from both interference and liveness

12:26 <alyssa> I'd believe it, just haven't thought through it yet

12:26 <icecream95> (Which means that both together provides a far larger relative speedup than only one)

12:27 <alyssa> right

12:27 <alyssa> these are both cases where tree scan is an unequivocably better algorithm

12:28 <HdkR> alyssa: It's mostly because ARM compiled code is significantly less aggressive at loop unrolling as far as I can tell.

12:28 <macc24> HdkR: because arm chips tend to have lower cache sizes, right?

12:28 <alyssa> (there's no explicit interference graph, and in theory there's no explicit liveness tracking like we have to do with graph colouring / LCRA)

12:29 <alyssa> Of course, tree scan brings a laundry list of its own problems.

12:29 <alyssa> (All of which have solutions, but complex ones.)

12:32 <icecream95> What sort of problems are there?

12:33 <alyssa> 1. Live range splitting is essentially non-optional. It's not 'hard' but it's complicated.

12:33 <alyssa> 2. Better spilling is also essentially non-optional. We'd want this anyway, but it seems complicated (at least the ir3 impl does ...)

12:34 <alyssa> 3. The IR needs to be scalarized (no more `bi_word`, instead SPLIT and COLLECT pseudoinstructions). This wouldn't be so bad, except...

12:35 <alyssa> 4. Splits need to be coalesced. On paper they can /always/ be coalesced. But the obvious design of a tree scan allocator won't do so. (ir3 has a very complicated solution for this. I have a prototype bifrost RA with a very simple one, but it's unclear whether mine could support live range splitting so that might not be useful.)

12:36 <alyssa> 5. Collects need to be coalesced. This isn't always possible, it should be clear why. So you need some decent heuristic. Similar issues as the splits.

12:37 <alyssa> ir3 is a state-of-the-art RA ... but its RA only is like 5kloc

12:38 <alyssa> s/only/alone/

12:38 <alyssa> ACO fails to coalesce some things even in easy cases and its RA isn't much simpler.

12:39 <alyssa> it's slightly unfair because we should include nir_convert_from_ssa in our accounting too

12:39 <alyssa> but even with that, our RA is, what, 1500 lines? and works reasonably well?

12:39 <icecream95> So... after trying to implement it yourself, you came to the expected conclusion that SSA RA takes a lot of code? :P

12:39 <alyssa> 2000 with nodearray stuff?

12:39 <alyssa> icecream you know how stubborn I am ;p

12:40 <icecream95> You think you're stubborn? :P

12:40 <alyssa> TBH the more bothersome part is how little of this code can be shared between backends

12:41 <alyssa> take parallel copy lowering

12:41 <alyssa> (500 lines or so in ir3)

12:41 <alyssa> The core algorithm is the same for everyone and can be shared, I tried this

12:42 <alyssa> but the details are machine-specific (where does bifrost's fp16vec2 fit in? or ir3 sct/gat instructions? or aco's baffling subword shuffle hacks?) and not obvious how to generalize

12:43 <alyssa> and there's no good way to share a backend IR across backends, given how "weird" GPUs are

12:43 <alyssa> (both the biggest advantage LLVM has over NIR, and the biggest problem with LLVM for GPUs)

12:43 <icecream95> i know what if we made GPU vendors standardise on a common ISA?

12:44 <alyssa> oo i know we should make all GPUs have SPIR-V as their common ISA

12:44 <alyssa> and then we don't need NIR

12:44 <icecream95> ("Then they wouldn't be able to vendor-lockin their tools")

12:44 <alyssa> because vulkan gives you spir-v so the hardware should take spir-v, and everything works so good

12:44 <alyssa> so yeah. spir-v hardware for everyone!

12:45 <icecream95> Uh... err... like spir-v is great... but are you going to make the GPU parse struct definitions and so forth?

12:45 <alyssa> yes obviously

12:45 <alyssa> i mean, errrr, what are structs?

12:48 <icecream95> OpTypeStruct: Declare a new structure type.

12:49 <robmur01> pff, if it was good enough for iAPX 432, it's surely good enough for GPUs :P

12:56 <icecream95> Hmm.. did iAPX 432 use capabilities? It only took 40 years for that to make a comeback

12:58 <icecream95> IBM removed capability support because "they could find no way to revoke capabilities". ISTR that CHERI has quite a complex scheme for managing that

12:59 <alyssa> eh, the new isa seems great

12:59 <alyssa> capabilities are just the cheri on top

13:01 <icecream95> What does this remind me of? "The instruction set also used bit-aligned variable-length instructions"

13:02 <alyssa> Oh no

13:04 <icecream95> But this is far worse; instructions range from 6 to 321 bits

13:04 <alyssa> O_O

13:29 <icecream95> Fun... I'm running the samples from the old Mali GLES SDK for I think the first time ever, and of course I've hit a Panfrost bug already

13:31 <alyssa> You're good at hitting Panfrost bugs :-P

13:32 <icecream95> And now there is a bug where an SDK function is clashing with one from C++17

13:34 <icecream95> Ooh fun, another Panfrost bug

13:35 * icecream95 thinks that it is maybe time to sleep

13:37 <alyssa> night :)

13:37 <alyssa> jekstrand: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15720#note_1373963

13:37 <alyssa> is this true?

13:37 * robmur01 keeps foolishly trying glmark as a sanity-check workload for horrible DMA stuff on a 5.10 kernel... such bug, so crash, etc.

13:37 <icecream95> g'night

13:37 icecream95 has left #panfrost [rcirc on GNU Emacs 28.1]

14:36 <q4a> Hi. I'm testing panfrost from Mesa 22. In glxinfo -B I can see: OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10 , but OpenGL core profile shading language version string: 1.40

14:38 <q4a> 1.40 - is a bit lower, than I tough. Is this my fail or panfrost did not support GLSL higher then 1.40 ?

14:39 <q4a> *thought

14:39 alyssa has quit [Quit: leaving]

14:46 <HdkR> q4a: Watch out for ESSL versus GLSL

14:47 <HdkR> ESSL will end up being 3.10

14:48 <HdkR> While ESSL 3.10 matches GLES 3.1, GLSL 1.40 matches GL 3.0

14:49 <q4a> yes, sorry, my mistake

14:51 <q4a> Thank you!

14:51 <HdkR> oop, GLSL 1.4 is GL 3.1, misremembered a version :)

15:07 greenjustin has joined #panfrost

15:11 <jekstrand> alarumbe: Generally, yes. I've not looked into the specific problem you're hitting, though.

15:52 alyssa has joined #panfrost

15:52 <alyssa> ...and it turns out my "cute" workaround for a valhall thing doesn't work because of spilling

15:53 <alyssa> lovely

15:57 guillaume_g has quit []

16:09 * alyssa respins XFB series

16:14 <alyssa> Guess I'm rewriting lower_fragcolor then

16:24 MajorBiscuit has quit [Ping timeout: 480 seconds]

16:42 nlhowell has joined #panfrost

16:51 nlhowell has quit [Ping timeout: 480 seconds]

17:37 <alyssa> deqp what do you want from me

17:38 <alyssa> why does deqp-runner think every test is failing

17:46 <alyssa> but only if nir_remove_dead_variables is called early?!

17:47 <alyssa> traces are the same

17:47 <alyssa> the heck?

17:50 <alyssa> everything is passed when run in series

17:51 * alyssa tries single threading deqp-runner

17:51 <alyssa> no, still dies when single threaded

17:57 <robclark> something somehow test order dependent?

18:15 <alyssa> robclark: dumber... I think I was forgetting to override libgl_drivers_path and getting (conformant) system mesa instead ;-p

18:17 <robclark> ahh :-P

18:36 anholt has joined #panfrost

18:43 Daanct12 has joined #panfrost

19:43 <alyssa> Pass: 17968, Fail: 28, Crash: 1, Skip: 19794, Duration: 10:58, Remaining: 0

19:43 <alyssa> uh, close

19:51 icecream95 has joined #panfrost

20:18 alpernebbi has quit [Ping timeout: 480 seconds]

20:19 alpernebbi has joined #panfrost

20:25 <icecream95> TIL that Intel made SBCs: "iSBC 432/100 single-board computer"

20:26 <macc24> icecream95: remember intel edison?

20:33 rkanwal has quit [Quit: rkanwal]

21:03 <alarumbe> hi alyssa

21:04 <alarumbe> would you mind having a quick glance at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14034

21:11 <alyssa> alarumbe: a cycle isn't necessarily wrong

21:11 <alyssa> bbrezillon: had a cool use for one, implementing indirect multidraw

21:12 <alyssa> (The code was scary, but worked AFAIK)

21:13 <alyssa> alarumbe: Largely I'm fine with the code there, but the kernel side needs to be merged first

21:21 <alarumbe> the problem is pandecode will go into an infinite loop and the crashdump analyser will never finish

21:22 <alyssa> Yes, I understand the problem

21:22 <alyssa> I think I'm ok with the solution, since no well behaved workload will hit that

21:22 <alarumbe> I think Borish told me a few months back I should get the mesa changes accepted first and then send the kernel patch

21:22 <alyssa> Ideally review of both happens together

21:23 <alarumbe> no worries then, I'll prepare the kernel patch and submit it asap

21:28 <alarumbe> I would've picked a less 'bulky' solution for storing the already visited BO job numbers but since the closest thing Mesa seems to have to a binary search tree is the rb tree API, I went for that instead

21:28 <alyssa> meh, it's not a hot path

21:35 <icecream95> alarumbe: It might be useful to have a way to replay the job on the GPU

21:35 <icecream95> I implemented that for my own dump format here: https://gitlab.com/panfork/mesa/-/blob/main/src/panfrost/lib/pan_replay.c

21:39 * alyssa wonders if this is an LCRA bug

21:40 <icecream95> alyssa: What's the LCRA bug?

21:41 <alyssa> icecream95: https://rosenzweig.io/hmm.txt

21:41 <alyssa> before/after shaders

21:41 <icecream95> hmm

21:42 <alyssa> missing interference on r0

21:42 <alyssa> (none of your patches are in this branch, it's my fault whatever it is ;) )

21:44 <alyssa> I'm more curious how we never hit this before

21:46 <icecream95> The bug only happens with message preloading, right? That's v7, so I couldn't have hit it myself until recently

21:47 <alyssa> It's surfacing because of a patch I added to make LCRA coalesce register moves

21:47 <alyssa> but it seems to be a very core defect

21:47 <alyssa> it doesn't matter if I'm yeeting LCRA anyway, I can drop that patch, but still.

21:48 * alyssa may or may not be yeeting LCRA. tbd.

21:48 <alyssa> this branch is still LCRA, but with a purer SSA

21:49 <icecream95> I think the fix would look similar to... oh wait there is no way to remove the move without reordering instructions

21:49 <icecream95> (I was going to suggest that it was similar to the case of function parameters)

21:49 <alyssa> Eeee

21:49 <icecream95> But with my calling convention I could just change the parameter registers to fix that

21:49 <alyssa> Splits/collect instead of offset and tightened rules for preloading. Notably no phi nodes here.

21:50 <alyssa> that bug notwithstanding, shader-db looks ok

21:51 <icecream95> ok means better than before the patch?

21:51 <alyssa> https://rosenzweig.io/eh.txt

21:52 <alyssa> (The reduction in spills is from adding a workaround from my scheduling branch because LCRA is stupid.)

21:53 <alyssa> tree scan can chew thru the moves and was actually beating on instruction count

21:54 <alyssa> (In general, SSA RA needs to do a lot more coalescing than traditional RA. But it's also a lot better at coalescing than traditional RA.)

21:57 * icecream95 goes back to trying to solve accelerometer calibration equations

22:46 rasterman has quit [Quit: Gettin' stinky!]

23:38 * alyssa is super behind on email