ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
digetx has quit [Remote host closed the connection]
camus has quit [Remote host closed the connection]
camus has joined #panfrost
digetx has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus1 has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
Daanct12 has joined #panfrost
camus1 has quit [Remote host closed the connection]
camus has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
Daanct12 has quit [Quit: Leaving]
Daanct12 has joined #panfrost
icecream95 has joined #panfrost
<alyssa> ndufresne: I do believe the partial render mechanism is firmware (or even kernel) triggered, not hardware triggered
<alyssa> so no fault needed
<alyssa> I assume Apple has the relevant stuff in their stack
<alyssa> but the partial render infrastructure is but a small piece of that puzzle
hexdump0815 has joined #panfrost
hexdump01 has quit [Ping timeout: 480 seconds]
Daaanct12 has joined #panfrost
Daanct12 has quit [Ping timeout: 480 seconds]
guillaume_g has joined #panfrost
<macc24> ndufresne: i think there's a way
nlhowell has joined #panfrost
Daaanct12 is now known as Daanct12
Pu244 has quit [Ping timeout: 480 seconds]
rasterman has joined #panfrost
<icecream95> HdkR: So now I have two different GCC versions installed to make compiling FEX with Clang work?
<HdkR> icecream95: Hm?
<icecream95> HdkR: Debian bookworm seems to be in the middle of moving from GCC 11 to 12, and so Clang needs C++ headers from GCC 12 but /usr/bin/gcc is still GCC 11
<HdkR> Broken clang environment?
<icecream95> And now it failed to compile: FEXLogServer/Main.cpp:125:24: error: no matching member function for call to 'erase'
<HdkR> That's pretty bad if it can't even find the erase function from std::unordered_map
<icecream95> Seems that it thinks the FDs set is const
<HdkR> That's weirdly broken
pi_ has joined #panfrost
anarsoul|2 has joined #panfrost
anarsoul has quit [Read error: Connection reset by peer]
<icecream95> __erase_nodes_if(_Container& __cont, const _UnsafeContainer& __ucont, _Predicate __pred)
<icecream95> The header seems to be broken, __ucont shouldn't be const
* icecream95 changes the header to always use __cont rather than __ucont
<icecream95> Hmm.. the bug appears to still be in GCC upstream?
<icecream95> And remove_if is also broken, but in a different way?
MajorBiscuit has joined #panfrost
MajorBiscuit has quit []
MajorBiscuit has joined #panfrost
MajorBiscuit has quit []
* icecream95 instead changes the code to explicitly iterate over PIDs with a for loop
icecream95 has quit [Ping timeout: 480 seconds]
MajorBiscuit has joined #panfrost
rkanwal has joined #panfrost
nlhowell has quit [Ping timeout: 480 seconds]
icecream95 has joined #panfrost
<icecream95> HdkR: openssl sha256 is 20x slower than native, FEX is so terrible noone should use it! /s
<icecream95> (Next release: algebraic optimisations replace code implementing SHA with the corresponding AArch64 instruction)
<HdkR> icecream95: Did you use latest main?
<HdkR> icecream95: Also is your CPU new enough to support RCPC? Also did you make sure to compile Release + LTO?
nlhowell has joined #panfrost
<HdkR> Also if your native CPU supports the SHA256 extensions it is a bit unfair since FEX doesn't support the SHA256 x86 instructions yet :P
Daanct12 has quit [Ping timeout: 480 seconds]
icecream95 has quit [Ping timeout: 480 seconds]
icecream95 has joined #panfrost
<icecream95> HdkR: It's not so bad for other programs, I deliberately chose an unfair test
<HdkR> hah
<HdkR> In main we just recently fixed RCPC which can make things slightly faster, but it also makes the non-TSO path more stable. Can give like a free 3x perf boost if the program is sane.
<icecream95> Cool, so 1.5x native performance!
<HdkR> You say that, but we've actually had some instances where we were running emulated code faster than native.
nlhowell is now known as Guest298
nlhowell has joined #panfrost
Guest298 has quit [Ping timeout: 480 seconds]
nlhowell is now known as Guest300
nlhowell has joined #panfrost
Guest300 has quit [Ping timeout: 480 seconds]
nlhowell has quit [Ping timeout: 480 seconds]
pjakobsson has quit [Remote host closed the connection]
<alyssa> HdkR: :D
<alyssa> I can see that if the native code was built with -O0 or something :p
<robmur01> nah, it's like a return to the mid-'90s when people were buying DEC Alphas to run their 16-bit x86 code under Windows faster than Pentiums could :D
<macc24> o_o
<icecream95> Looking through an old chroot from 2017... Anyone here remember the chromebook-setup.sh script for installing a blobby debian on the old Samsung chromebooks and C201?
<icecream95> I installed Debian Jessie for that because I thought that 4.9 (which the script would download if cross-compiling) was the only GCC version which could compile the kernel
<macc24> icecream95: blobby debian?
<macc24> 'old samsung chromebooks'... exynos machines?
<alyssa> icecream95: ....snow?
<alyssa> I have heard horrors about that machine
<icecream95> alyssa: Exynos 5250, 5420, 5422, which includes snow
<alyssa> Scary
<icecream95> macc24: Nope, that's diff... wait a second that looks very similar, it must be a newer version of the script
<icecream95> But it's $THE_CURRENT_YEAR, why is the script still being updated?
<macc24> icecream95: because people maintain their stuff
<macc24> *hides her stuff*
<macc24> usually
<alyssa> icecream95: by the way, I spent most of last week thinking through what we want long-term out of bifrost/valhall RA
<alyssa> I'm still on the fence
<icecream95> macc24: Wait that script doesn't use the blob, so I guess it's fine
<macc24> icecream95: what blob?
* icecream95 pushes alyssa off the fence
<alyssa> In the short term, I think I convinced myself I want the nodearray patches, at least in the short/medium term
<icecream95> macc24: mali
* macc24 shivers
<alyssa> (At least for the interference matrix. I haven't thought through the liveness analysis side of it.)
<alyssa> graph colouring sucks, linear scan sucks, lcra sucks, ssa-based ra sucks, it all sucks
<alyssa> Pick a poison..
<icecream95> I think there were fairly equal performance gains from both interference and liveness
<alyssa> I'd believe it, just haven't thought through it yet
<icecream95> (Which means that both together provides a far larger relative speedup than only one)
<alyssa> right
<alyssa> these are both cases where tree scan is an unequivocably better algorithm
<HdkR> alyssa: It's mostly because ARM compiled code is significantly less aggressive at loop unrolling as far as I can tell.
<macc24> HdkR: because arm chips tend to have lower cache sizes, right?
<alyssa> (there's no explicit interference graph, and in theory there's no explicit liveness tracking like we have to do with graph colouring / LCRA)
<alyssa> Of course, tree scan brings a laundry list of its own problems.
<alyssa> (All of which have solutions, but complex ones.)
<icecream95> What sort of problems are there?
<alyssa> 1. Live range splitting is essentially non-optional. It's not 'hard' but it's complicated.
<alyssa> 2. Better spilling is also essentially non-optional. We'd want this anyway, but it seems complicated (at least the ir3 impl does ...)
<alyssa> 3. The IR needs to be scalarized (no more `bi_word`, instead SPLIT and COLLECT pseudoinstructions). This wouldn't be so bad, except...
<alyssa> 4. Splits need to be coalesced. On paper they can /always/ be coalesced. But the obvious design of a tree scan allocator won't do so. (ir3 has a very complicated solution for this. I have a prototype bifrost RA with a very simple one, but it's unclear whether mine could support live range splitting so that might not be useful.)
<alyssa> 5. Collects need to be coalesced. This isn't always possible, it should be clear why. So you need some decent heuristic. Similar issues as the splits.
<alyssa> ir3 is a state-of-the-art RA ... but its RA only is like 5kloc
<alyssa> s/only/alone/
<alyssa> ACO fails to coalesce some things even in easy cases and its RA isn't much simpler.
<alyssa> it's slightly unfair because we should include nir_convert_from_ssa in our accounting too
<alyssa> but even with that, our RA is, what, 1500 lines? and works reasonably well?
<icecream95> So... after trying to implement it yourself, you came to the expected conclusion that SSA RA takes a lot of code? :P
<alyssa> 2000 with nodearray stuff?
<alyssa> icecream you know how stubborn I am ;p
<icecream95> You think you're stubborn? :P
<alyssa> TBH the more bothersome part is how little of this code can be shared between backends
<alyssa> take parallel copy lowering
<alyssa> (500 lines or so in ir3)
<alyssa> The core algorithm is the same for everyone and can be shared, I tried this
<alyssa> but the details are machine-specific (where does bifrost's fp16vec2 fit in? or ir3 sct/gat instructions? or aco's baffling subword shuffle hacks?) and not obvious how to generalize
<alyssa> and there's no good way to share a backend IR across backends, given how "weird" GPUs are
<alyssa> (both the biggest advantage LLVM has over NIR, and the biggest problem with LLVM for GPUs)
<icecream95> i know what if we made GPU vendors standardise on a common ISA?
<alyssa> oo i know we should make all GPUs have SPIR-V as their common ISA
<alyssa> and then we don't need NIR
<icecream95> ("Then they wouldn't be able to vendor-lockin their tools")
<alyssa> because vulkan gives you spir-v so the hardware should take spir-v, and everything works so good
<alyssa> so yeah. spir-v hardware for everyone!
<icecream95> Uh... err... like spir-v is great... but are you going to make the GPU parse struct definitions and so forth?
<alyssa> yes obviously
<alyssa> i mean, errrr, what are structs?
<icecream95> OpTypeStruct: Declare a new structure type.
<robmur01> pff, if it was good enough for iAPX 432, it's surely good enough for GPUs :P
<icecream95> Hmm.. did iAPX 432 use capabilities? It only took 40 years for that to make a comeback
<icecream95> IBM removed capability support because "they could find no way to revoke capabilities". ISTR that CHERI has quite a complex scheme for managing that
<alyssa> eh, the new isa seems great
<alyssa> capabilities are just the cheri on top
<icecream95> What does this remind me of? "The instruction set also used bit-aligned variable-length instructions"
<alyssa> Oh no
<icecream95> But this is far worse; instructions range from 6 to 321 bits
<alyssa> O_O
<icecream95> Fun... I'm running the samples from the old Mali GLES SDK for I think the first time ever, and of course I've hit a Panfrost bug already
<alyssa> You're good at hitting Panfrost bugs :-P
<icecream95> And now there is a bug where an SDK function is clashing with one from C++17
<icecream95> Ooh fun, another Panfrost bug
* icecream95 thinks that it is maybe time to sleep
<alyssa> night :)
<alyssa> is this true?
* robmur01 keeps foolishly trying glmark as a sanity-check workload for horrible DMA stuff on a 5.10 kernel... such bug, so crash, etc.
<icecream95> g'night
icecream95 has left #panfrost [rcirc on GNU Emacs 28.1]
<q4a> Hi. I'm testing panfrost from Mesa 22. In glxinfo -B I can see: OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10 , but OpenGL core profile shading language version string: 1.40
<q4a> 1.40 - is a bit lower, than I tough. Is this my fail or panfrost did not support GLSL higher then 1.40 ?
<q4a> *thought
alyssa has quit [Quit: leaving]
<HdkR> q4a: Watch out for ESSL versus GLSL
<HdkR> ESSL will end up being 3.10
<HdkR> While ESSL 3.10 matches GLES 3.1, GLSL 1.40 matches GL 3.0
<q4a> yes, sorry, my mistake
<q4a> Thank you!
<HdkR> oop, GLSL 1.4 is GL 3.1, misremembered a version :)
greenjustin has joined #panfrost
<jekstrand> alarumbe: Generally, yes. I've not looked into the specific problem you're hitting, though.
alyssa has joined #panfrost
<alyssa> ...and it turns out my "cute" workaround for a valhall thing doesn't work because of spilling
<alyssa> lovely
guillaume_g has quit []
* alyssa respins XFB series
<alyssa> Guess I'm rewriting lower_fragcolor then
MajorBiscuit has quit [Ping timeout: 480 seconds]
nlhowell has joined #panfrost
nlhowell has quit [Ping timeout: 480 seconds]
<alyssa> deqp what do you want from me
<alyssa> why does deqp-runner think every test is failing
<alyssa> but only if nir_remove_dead_variables is called early?!
<alyssa> traces are the same
<alyssa> the heck?
<alyssa> everything is passed when run in series
* alyssa tries single threading deqp-runner
<alyssa> no, still dies when single threaded
<robclark> something somehow test order dependent?
<alyssa> robclark: dumber... I think I was forgetting to override libgl_drivers_path and getting (conformant) system mesa instead ;-p
<robclark> ahh :-P
anholt has joined #panfrost
Daanct12 has joined #panfrost
<alyssa> Pass: 17968, Fail: 28, Crash: 1, Skip: 19794, Duration: 10:58, Remaining: 0
<alyssa> uh, close
icecream95 has joined #panfrost
alpernebbi has quit [Ping timeout: 480 seconds]
alpernebbi has joined #panfrost
<icecream95> TIL that Intel made SBCs: "iSBC 432/100 single-board computer"
<macc24> icecream95: remember intel edison?
rkanwal has quit [Quit: rkanwal]
<alarumbe> hi alyssa
<alarumbe> would you mind having a quick glance at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14034
<alyssa> alarumbe: a cycle isn't necessarily wrong
<alyssa> bbrezillon: had a cool use for one, implementing indirect multidraw
<alyssa> (The code was scary, but worked AFAIK)
<alyssa> alarumbe: Largely I'm fine with the code there, but the kernel side needs to be merged first
<alarumbe> the problem is pandecode will go into an infinite loop and the crashdump analyser will never finish
<alyssa> Yes, I understand the problem
<alyssa> I think I'm ok with the solution, since no well behaved workload will hit that
<alarumbe> I think Borish told me a few months back I should get the mesa changes accepted first and then send the kernel patch
<alyssa> Ideally review of both happens together
<alarumbe> no worries then, I'll prepare the kernel patch and submit it asap
<alarumbe> I would've picked a less 'bulky' solution for storing the already visited BO job numbers but since the closest thing Mesa seems to have to a binary search tree is the rb tree API, I went for that instead
<alyssa> meh, it's not a hot path
<icecream95> alarumbe: It might be useful to have a way to replay the job on the GPU
<icecream95> I implemented that for my own dump format here: https://gitlab.com/panfork/mesa/-/blob/main/src/panfrost/lib/pan_replay.c
* alyssa wonders if this is an LCRA bug
<icecream95> alyssa: What's the LCRA bug?
<alyssa> icecream95: https://rosenzweig.io/hmm.txt
<alyssa> before/after shaders
<icecream95> hmm
<alyssa> missing interference on r0
<alyssa> (none of your patches are in this branch, it's my fault whatever it is ;) )
<alyssa> I'm more curious how we never hit this before
<icecream95> The bug only happens with message preloading, right? That's v7, so I couldn't have hit it myself until recently
<alyssa> It's surfacing because of a patch I added to make LCRA coalesce register moves
<alyssa> but it seems to be a very core defect
<alyssa> it doesn't matter if I'm yeeting LCRA anyway, I can drop that patch, but still.
* alyssa may or may not be yeeting LCRA. tbd.
<alyssa> this branch is still LCRA, but with a purer SSA
<icecream95> I think the fix would look similar to... oh wait there is no way to remove the move without reordering instructions
<icecream95> (I was going to suggest that it was similar to the case of function parameters)
<alyssa> Eeee
<icecream95> But with my calling convention I could just change the parameter registers to fix that
<alyssa> Splits/collect instead of offset and tightened rules for preloading. Notably no phi nodes here.
<alyssa> that bug notwithstanding, shader-db looks ok
<icecream95> ok means better than before the patch?
<alyssa> (The reduction in spills is from adding a workaround from my scheduling branch because LCRA is stupid.)
<alyssa> tree scan can chew thru the moves and was actually beating on instruction count
<alyssa> (In general, SSA RA needs to do a lot more coalescing than traditional RA. But it's also a lot better at coalescing than traditional RA.)
* icecream95 goes back to trying to solve accelerometer calibration equations
rasterman has quit [Quit: Gettin' stinky!]
* alyssa is super behind on email