ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
digetx has quit [Remote host closed the connection]
camus has quit [Remote host closed the connection]
camus has joined #panfrost
digetx has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus1 has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
Daanct12 has joined #panfrost
camus1 has quit [Remote host closed the connection]
camus has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
Daanct12 has quit [Quit: Leaving]
Daanct12 has joined #panfrost
icecream95 has joined #panfrost
<alyssa>
ndufresne: I do believe the partial render mechanism is firmware (or even kernel) triggered, not hardware triggered
<alyssa>
so no fault needed
<alyssa>
I assume Apple has the relevant stuff in their stack
<alyssa>
but the partial render infrastructure is but a small piece of that puzzle
hexdump0815 has joined #panfrost
hexdump01 has quit [Ping timeout: 480 seconds]
Daaanct12 has joined #panfrost
Daanct12 has quit [Ping timeout: 480 seconds]
guillaume_g has joined #panfrost
<macc24>
ndufresne: i think there's a way
nlhowell has joined #panfrost
Daaanct12 is now known as Daanct12
Pu244 has quit [Ping timeout: 480 seconds]
rasterman has joined #panfrost
<icecream95>
HdkR: So now I have two different GCC versions installed to make compiling FEX with Clang work?
<HdkR>
icecream95: Hm?
<icecream95>
HdkR: Debian bookworm seems to be in the middle of moving from GCC 11 to 12, and so Clang needs C++ headers from GCC 12 but /usr/bin/gcc is still GCC 11
<HdkR>
Broken clang environment?
<icecream95>
And now it failed to compile: FEXLogServer/Main.cpp:125:24: error: no matching member function for call to 'erase'
<HdkR>
That's pretty bad if it can't even find the erase function from std::unordered_map
<icecream95>
Seems that it thinks the FDs set is const
<HdkR>
That's weirdly broken
pi_ has joined #panfrost
anarsoul|2 has joined #panfrost
anarsoul has quit [Read error: Connection reset by peer]
<icecream95>
The header seems to be broken, __ucont shouldn't be const
* icecream95
changes the header to always use __cont rather than __ucont
<icecream95>
Hmm.. the bug appears to still be in GCC upstream?
<icecream95>
And remove_if is also broken, but in a different way?
MajorBiscuit has joined #panfrost
MajorBiscuit has quit []
MajorBiscuit has joined #panfrost
MajorBiscuit has quit []
* icecream95
instead changes the code to explicitly iterate over PIDs with a for loop
icecream95 has quit [Ping timeout: 480 seconds]
MajorBiscuit has joined #panfrost
rkanwal has joined #panfrost
nlhowell has quit [Ping timeout: 480 seconds]
icecream95 has joined #panfrost
<icecream95>
HdkR: openssl sha256 is 20x slower than native, FEX is so terrible noone should use it! /s
<icecream95>
(Next release: algebraic optimisations replace code implementing SHA with the corresponding AArch64 instruction)
<HdkR>
icecream95: Did you use latest main?
<HdkR>
icecream95: Also is your CPU new enough to support RCPC? Also did you make sure to compile Release + LTO?
nlhowell has joined #panfrost
<HdkR>
Also if your native CPU supports the SHA256 extensions it is a bit unfair since FEX doesn't support the SHA256 x86 instructions yet :P
Daanct12 has quit [Ping timeout: 480 seconds]
icecream95 has quit [Ping timeout: 480 seconds]
icecream95 has joined #panfrost
<icecream95>
HdkR: It's not so bad for other programs, I deliberately chose an unfair test
<HdkR>
hah
<HdkR>
In main we just recently fixed RCPC which can make things slightly faster, but it also makes the non-TSO path more stable. Can give like a free 3x perf boost if the program is sane.
<icecream95>
Cool, so 1.5x native performance!
<HdkR>
You say that, but we've actually had some instances where we were running emulated code faster than native.
nlhowell is now known as Guest298
nlhowell has joined #panfrost
Guest298 has quit [Ping timeout: 480 seconds]
nlhowell is now known as Guest300
nlhowell has joined #panfrost
Guest300 has quit [Ping timeout: 480 seconds]
nlhowell has quit [Ping timeout: 480 seconds]
pjakobsson has quit [Remote host closed the connection]
<alyssa>
HdkR: :D
<alyssa>
I can see that if the native code was built with -O0 or something :p
<robmur01>
nah, it's like a return to the mid-'90s when people were buying DEC Alphas to run their 16-bit x86 code under Windows faster than Pentiums could :D
<macc24>
o_o
<icecream95>
Looking through an old chroot from 2017... Anyone here remember the chromebook-setup.sh script for installing a blobby debian on the old Samsung chromebooks and C201?
<icecream95>
I installed Debian Jessie for that because I thought that 4.9 (which the script would download if cross-compiling) was the only GCC version which could compile the kernel
<icecream95>
macc24: Nope, that's diff... wait a second that looks very similar, it must be a newer version of the script
<icecream95>
But it's $THE_CURRENT_YEAR, why is the script still being updated?
<macc24>
icecream95: because people maintain their stuff
<macc24>
*hides her stuff*
<macc24>
usually
<alyssa>
icecream95: by the way, I spent most of last week thinking through what we want long-term out of bifrost/valhall RA
<alyssa>
I'm still on the fence
<icecream95>
macc24: Wait that script doesn't use the blob, so I guess it's fine
<macc24>
icecream95: what blob?
* icecream95
pushes alyssa off the fence
<alyssa>
In the short term, I think I convinced myself I want the nodearray patches, at least in the short/medium term
<icecream95>
macc24: mali
* macc24
shivers
<alyssa>
(At least for the interference matrix. I haven't thought through the liveness analysis side of it.)
<alyssa>
graph colouring sucks, linear scan sucks, lcra sucks, ssa-based ra sucks, it all sucks
<alyssa>
Pick a poison..
<icecream95>
I think there were fairly equal performance gains from both interference and liveness
<alyssa>
I'd believe it, just haven't thought through it yet
<icecream95>
(Which means that both together provides a far larger relative speedup than only one)
<alyssa>
right
<alyssa>
these are both cases where tree scan is an unequivocably better algorithm
<HdkR>
alyssa: It's mostly because ARM compiled code is significantly less aggressive at loop unrolling as far as I can tell.
<macc24>
HdkR: because arm chips tend to have lower cache sizes, right?
<alyssa>
(there's no explicit interference graph, and in theory there's no explicit liveness tracking like we have to do with graph colouring / LCRA)
<alyssa>
Of course, tree scan brings a laundry list of its own problems.
<alyssa>
(All of which have solutions, but complex ones.)
<icecream95>
What sort of problems are there?
<alyssa>
1. Live range splitting is essentially non-optional. It's not 'hard' but it's complicated.
<alyssa>
2. Better spilling is also essentially non-optional. We'd want this anyway, but it seems complicated (at least the ir3 impl does ...)
<alyssa>
3. The IR needs to be scalarized (no more `bi_word`, instead SPLIT and COLLECT pseudoinstructions). This wouldn't be so bad, except...
<alyssa>
4. Splits need to be coalesced. On paper they can /always/ be coalesced. But the obvious design of a tree scan allocator won't do so. (ir3 has a very complicated solution for this. I have a prototype bifrost RA with a very simple one, but it's unclear whether mine could support live range splitting so that might not be useful.)
<alyssa>
5. Collects need to be coalesced. This isn't always possible, it should be clear why. So you need some decent heuristic. Similar issues as the splits.
<alyssa>
ir3 is a state-of-the-art RA ... but its RA only is like 5kloc
<alyssa>
s/only/alone/
<alyssa>
ACO fails to coalesce some things even in easy cases and its RA isn't much simpler.
<alyssa>
it's slightly unfair because we should include nir_convert_from_ssa in our accounting too
<alyssa>
but even with that, our RA is, what, 1500 lines? and works reasonably well?
<icecream95>
So... after trying to implement it yourself, you came to the expected conclusion that SSA RA takes a lot of code? :P
<alyssa>
2000 with nodearray stuff?
<alyssa>
icecream you know how stubborn I am ;p
<icecream95>
You think you're stubborn? :P
<alyssa>
TBH the more bothersome part is how little of this code can be shared between backends
<alyssa>
take parallel copy lowering
<alyssa>
(500 lines or so in ir3)
<alyssa>
The core algorithm is the same for everyone and can be shared, I tried this
<alyssa>
but the details are machine-specific (where does bifrost's fp16vec2 fit in? or ir3 sct/gat instructions? or aco's baffling subword shuffle hacks?) and not obvious how to generalize
<alyssa>
and there's no good way to share a backend IR across backends, given how "weird" GPUs are
<alyssa>
(both the biggest advantage LLVM has over NIR, and the biggest problem with LLVM for GPUs)
<icecream95>
i know what if we made GPU vendors standardise on a common ISA?
<alyssa>
oo i know we should make all GPUs have SPIR-V as their common ISA
<alyssa>
and then we don't need NIR
<icecream95>
("Then they wouldn't be able to vendor-lockin their tools")
<alyssa>
because vulkan gives you spir-v so the hardware should take spir-v, and everything works so good
<alyssa>
so yeah. spir-v hardware for everyone!
<icecream95>
Uh... err... like spir-v is great... but are you going to make the GPU parse struct definitions and so forth?
<alyssa>
yes obviously
<alyssa>
i mean, errrr, what are structs?
<icecream95>
OpTypeStruct: Declare a new structure type.
<robmur01>
pff, if it was good enough for iAPX 432, it's surely good enough for GPUs :P
<icecream95>
Hmm.. did iAPX 432 use capabilities? It only took 40 years for that to make a comeback
<icecream95>
IBM removed capability support because "they could find no way to revoke capabilities". ISTR that CHERI has quite a complex scheme for managing that
<alyssa>
eh, the new isa seems great
<alyssa>
capabilities are just the cheri on top
<icecream95>
What does this remind me of? "The instruction set also used bit-aligned variable-length instructions"
<alyssa>
Oh no
<icecream95>
But this is far worse; instructions range from 6 to 321 bits
<alyssa>
O_O
<icecream95>
Fun... I'm running the samples from the old Mali GLES SDK for I think the first time ever, and of course I've hit a Panfrost bug already
<alyssa>
You're good at hitting Panfrost bugs :-P
<icecream95>
And now there is a bug where an SDK function is clashing with one from C++17
<icecream95>
Ooh fun, another Panfrost bug
* icecream95
thinks that it is maybe time to sleep
* robmur01
keeps foolishly trying glmark as a sanity-check workload for horrible DMA stuff on a 5.10 kernel... such bug, so crash, etc.
<icecream95>
g'night
icecream95 has left #panfrost [rcirc on GNU Emacs 28.1]
<q4a>
Hi. I'm testing panfrost from Mesa 22. In glxinfo -B I can see: OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10 , but OpenGL core profile shading language version string: 1.40
<q4a>
1.40 - is a bit lower, than I tough. Is this my fail or panfrost did not support GLSL higher then 1.40 ?
<alyssa>
alarumbe: a cycle isn't necessarily wrong
<alyssa>
bbrezillon: had a cool use for one, implementing indirect multidraw
<alyssa>
(The code was scary, but worked AFAIK)
<alyssa>
alarumbe: Largely I'm fine with the code there, but the kernel side needs to be merged first
<alarumbe>
the problem is pandecode will go into an infinite loop and the crashdump analyser will never finish
<alyssa>
Yes, I understand the problem
<alyssa>
I think I'm ok with the solution, since no well behaved workload will hit that
<alarumbe>
I think Borish told me a few months back I should get the mesa changes accepted first and then send the kernel patch
<alyssa>
Ideally review of both happens together
<alarumbe>
no worries then, I'll prepare the kernel patch and submit it asap
<alarumbe>
I would've picked a less 'bulky' solution for storing the already visited BO job numbers but since the closest thing Mesa seems to have to a binary search tree is the rb tree API, I went for that instead
<alyssa>
meh, it's not a hot path
<icecream95>
alarumbe: It might be useful to have a way to replay the job on the GPU