ChanServ changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
<jannau>
randomly while experimenting with yuv overlays in dcp.py
Dcow has quit [Remote host closed the connection]
Dcow has joined #asahi-gpu
<alyssa>
Ah lol nice
Etrien has quit [Ping timeout: 480 seconds]
<alyssa>
Is there any Linux userspace that can use YUV overlays these days? :(
<jannau>
same flag needs to be set for yuv overlays
<jannau>
wayland compositor are supposed to use them for video playback via Linux DMA-Buf protocol
<daniels>
alyssa: yeah, weston
amarioguy has joined #asahi-gpu
Dcow has quit [Remote host closed the connection]
Dcow has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
Dcow has joined #asahi-gpu
chadmed has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
<phire>
lina: so I got mesa running under osx on my M1 Max. Seems to work just fine without your patch that hacks in the acceleration buffers
<phire>
so that might be an Ultra "two dies" only change
<phire>
I should get this running under hypervisor. I suspect those other buffers will also be absent
<phire>
Theory: However apple are sharing 3d workloads across the two dies, it might only work for compressed framebuffers?
Etrien has joined #asahi-gpu
<alyssa>
s/framebuffer/depth stencil/, lina's patch doesn't touch the colour attachments
<alyssa>
and that's entirely possible, there are lots of funny limitations here
<phire>
yeah, I found that weird. It was only touching depth+stencil
<alyssa>
Note that there is dedicated depth/stencil hardware (the ZLS)
<alyssa>
there is not dedicated colour buffer hardware
Dcow has joined #asahi-gpu
<alyssa>
the closest equivalent is the PBE which handles image writes in general
<alyssa>
but the ZLS knows what depth buffers and stencil buffers are,
<alyssa>
the PBE just writes images
Etrien__ has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
Dcow has quit [Ping timeout: 480 seconds]
<phire>
I guess all the depth stuff is already fixed-function anyway, due to the deferred shading nature
<phire>
ZLS = Z load store?
<alyssa>
Presumably
chadmed has quit [Quit: Konversation terminated!]
<lina>
phire: It "works" but it also faults sometimes. I did get frames out of it without the patch, but the hypervisor showed GPU faults.
<phire>
oh, point. I should chuck this into the hypervisor later
<lina>
Keep in mind that the depth buffer, unless the app really needs it, is only written to on TVB overflows
<lina>
And those go away after a few frames
<alyssa>
02:10 < lina> Keep in mind that the depth buffer, unless the app really needs it, is only written to on TVB overflows
<alyssa>
lina: sorry what?
<lina>
If the app isn't actually reading out the depth buffer there is no need to flush it out to RAM
<lina>
With my bunny renders, the depth buffer is unwritten if I make the TVB large enough that it can render in a single pass, I only get data there if it ends up splitting due to a small TVB
<alyssa>
Hm.
<alyssa>
02:11 < lina> If the app isn't actually reading out the depth buffer there is no need to flush it out to RAM
<alyssa>
That's an optimization that gallium/asahi isn't doing yet, though.
<alyssa>
With current Mesa, the intention is to always flush to RAM
<lina>
It sure is with my bunny framedumps...
<alyssa>
(yes I know that's wasteful)
<alyssa>
So I see two possibilities
<alyssa>
1. I screwed up Mesa and this isn't working on macOS, but in that case I'd expect to see more CTS fails
<alyssa>
2. Mesa is working ok, but your driver/Python isn't setting the right control register bits to make this work
<lina>
Mind you, the mesa I'm using for python testing is very outdated right now, so if you fixed this recently, that might explain it. Or I got the depth flags wrong.
<lina>
Or... hold on.
<phire>
or there are two depth buffers?
<lina>
Ohhh....
<phire>
final output, and in-progress?
<lina>
alyssa: The depth buffer *is* written unconditionally. However, it's written uncompressed!
<lina>
So only the intermediate flushes use the acceleration buffer!
<lina>
So you probably found the flag to make Z uncompressed on *final store*, and I am setting that right
<lina>
But you're still faulting on intermediate flushes, before the TVB grows in Mesa, without the metadata buffer
<lina>
And that's why it looks like it works
<lina>
You just fault for a few frames iff the geometry is complex enough
<alyssa>
Oh, geez
LunaFoxgirlVT has joined #asahi-gpu
<alyssa>
That seems plausible
<lina>
But I still haven't seen this on M1, only Ultra
<alyssa>
right
<lina>
So maybe intermediate Z flushes on Ultra using compression is a new G13X thing
<alyssa>
I could believe that
<alyssa>
Although it does seem odd to change any of this *within* a major arch,
<lina>
It could be a config change
<alyssa>
though maybe it's specced out for G13 but it's a model-specific feature bit
<lina>
Either global or per batch
<lina>
Let me look at what changed...
<phire>
there were a few unks that flipped from 0 to 1
<lina>
And at least a couple general bitmask like things with different values
<alyssa>
There's precedent for G13P missing capabilities that G13G has, I think)
<LunaFoxgirlVT>
^•w•^
<alyssa>
LunaFoxgirlVT: Nya!~
<LunaFoxgirlVT>
heyo
<phire>
hey LunaFoxgirlVT
<phire>
though I understand apple has a habbit of just not enabling features that weren't ready in time, and never enabling them
<alyssa>
LunaFoxgirlVT: we haven't talked in a while but i wanted to let you know that my sister really likes inochi2d o:)
<LunaFoxgirlVT>
alyssa: こんこん
<phire>
but given the Pro, Max and Ultra all have the exact same gpu name (G13X), I suspect they share features
<LunaFoxgirlVT>
alyssa: I'm in call with her right now lol
<alyssa>
LunaFoxgirlVT: ^_^
<alyssa>
phire: G13X isn't the name
<alyssa>
the X is a wildcard
<alyssa>
G13D, G13S, and i forget the other one
<phire>
really?
<alyssa>
yes
<LunaFoxgirlVT>
Excited for the display port stuff to get merged some day so that I can see stuff on my mac mini conveniently lol
<LunaFoxgirlVT>
At some point I gotta start building and releasing arm linux releases of Inochi2D =w=
<LunaFoxgirlVT>
but it already runs on a raspberry pi 4 so it's a good start
<lina>
Also I just realized my bunny is very, very broken with TVB overflows...
<phire>
lina: did you push your changes to m1n1 somewhere?
<lina>
Not yet
<lina>
Need to clean it up
<phire>
yeah, figured
<phire>
I can probally check from macos if my m1 max is allocating those accleration structures
<lina>
Yeah, they show up in the command buffers
<lina>
I also need to check again on the M1 Mini, since I didn't actually look there
<lina>
I just don't see them in any HV logs I have lying around
<phire>
guess that's a task for tomorrows stream
<alyssa>
I am curious, is the value of zls_control different on g13x from g13g?
<lina>
I just got there w
<lina>
Setting zls_control differently fixes the bunny on TVB overflows
<lina>
Looks like it needs 0x44 ORed in
dviola has quit [Ping timeout: 480 seconds]
<lina>
(Still storing to the accel buffer either way though)
<alyssa>
Ok, I definitely don't see those bits on G13G
<alyssa>
0x44 in the RGX zls_control is "zloaden | forcezstore"
<lina>
Oh, I have logs with those flags in the past... but I don't recall ever taking logs on an M1 Ultra before?
<alyssa>
which is possible but maybe unlikely
<lina>
Interesting...
<alyssa>
I think 0x44 is just the general Z compression enable
<alyssa>
02:32 < phire> I can probally check from macos if my m1 max is allocating those accleration structures
<lina>
And yet I still get it using the compression buffer without that
<alyssa>
phire: very curious to the answer here if you can manage to get a linear depth buffer
<lina>
It just breaks
<alyssa>
lina: \shrug/
<lina>
Also I found logs with 0x80044 that were definitely taken on an M1...
<alyssa>
the real solution here is to r/e compression on g13g and wire that up on g13g
<phire>
lina: is that cute flag demo somewhere?
<lina>
Don't think so w
<lina>
*And* those logs don't have the compression meta buffer on G13G
<lina>
No wait they do
<phire>
um... wat. is this just something that was always broken?
<lina>
OK, so this is related to compression... but then why is G13D enabling compression even without them, partially, then breaking?
<lina>
I see a 1:1 correlation between 0x44 and the accel/meta buffer existing
<phire>
also, who is forcing compression on? the kernel driver, or the agx firmware?
<lina>
In old G13G traces (that I now found - there are few of them)
<lina>
phire: Partially, even!
chadmed has joined #asahi-gpu
<alyssa>
02:44 < lina> I see a 1:1 correlation between 0x44 and the accel/meta buffer existing
<phire>
seperate control over TVB overflow and final depthbuffer is a useful feature
<alyssa>
Ok, that jives with what i've seen on g
<lina>
Yeah, but not having 0x44 seems to break the overflow, like it stores compressed but loads uncompressed.
chadmed has quit []
<lina>
Actually this fits the image I'm seeing... I get lots of mini-tiles with stuff in the top left.
<lina>
That's... exactly what you would expect of compressed data loaded as uncompressed
<alyssa>
eyes emoji
<lina>
Maybe this is an outright bug? G13X always does partial depth stores compressed?
<LunaFoxgirlVT>
data has been squimshed (´・ω・)
<phire>
agx firmware bug? I'd be supprised if it was a hardware bug
<lina>
Firmware bug would be kind of weird... especially since it's a shared codebase
<phire>
but hardware bug would be weirder. how would the PBE know the diffrence between a final store and a TVB flush store?
<phire>
unless... what if there is a hardware bug loading non-compressed depth/stencil buffers, and this is a workaround?
<lina>
But then why is *store* being forced on, but not *load*?
<phire>
hmmm.... weird
<phire>
also, this is a waste of memory, having seperate buffers for the TVB overflows and the final depth
<lina>
They aren't separate, the extra buffer is the *metadata* buffer
<lina>
Which is needed for compression
<phire>
so that means the buffer has the same layout when compressed?
<lina>
The main buffer is the same, it just has compressed tiles and the metadata buffer tells it how they were compressed.
<lina>
The metadata buffer has one byte per 8x4 pixel block in the main buffer
<lina>
This is pretty typical framebuffer compression stuff
<phire>
ah, makes sense. so you can fallback to uncompressed if compression fails to produce a smaller result
<lina>
Yes, exactly
<lina>
You always need more space any time you use compression, by definition
<alyssa>
Now I want to know what macOS on m1 ultra does with the "do not compress depth" magic with metal
<lina>
FB compression isn't about size, it's about bandwidth, so the buffers are actually slightly larger (extra meta)
<alyssa>
I honestly can't remember how I got it do that
<phire>
and framebuffer compresion *only* cares about reducing memory bandwidth, not buffer size
<alyssa>
something about rendering to a texture with usage Unknown and allowGPU = false
<lina>
What's a good macOS app to exercise the depth buffer? I just did boot->safari->youtube and there isn't a single render with depth...
<lina>
Some random webGL demo?
* lina
opens up WebGL Aquarium
<lina>
depth_flags = 0x5000000000c4154
<lina>
What the heck. Well, that's a new one.
<lina>
I see the stencil aux buffers too, so this was good to name those fields...
<lina>
Huh.
<lina>
alyssa: Remember how there is more than one depth buffer?
<lina>
depth_buffer_ptr1 = 0x1501c68000
<lina>
depth_buffer_ptr2 = 0x1501750000
<lina>
depth_buffer_ptr3 = 0x1501c68000
<lina>
Now I wonder if this for separate buffers for load/partial/store
<lina>
And then similarly the meta buffers are separate
<lina>
Since some could be compressed and not others
<phire>
that's what I was already wondering. though why is it so keen to avoid writing to the store buffer until the end?
<phire>
oh... multisample resolve?
<lina>
It makes sense to compress the intermediate buffers but not the final ones if the app needs it for some reason
<lina>
And I guess under some conditions maybe you can't guarantee that the final buffer is "safe"? Like, the GPU can't trust it to read back during an internal reload?
<lina>
So you'd want to use a separate buffer from the user-provided buffer in that case
<phire>
yes, but if you are doing that, might as well use the same buffer for both partial and store. It's only if the final store buffer has a diffrent layout for some reason that you need seperate buffers
<phire>
safe? that's very paranoid. but I guess
<lina>
Yeah, so maybe if the app provides a custom layout Z buffer it can't use compression, and then you need a separate buffer for internal
<phire>
wait, these are all supplied from userspace, so app has control over the partial anyway
<lina>
Metal does
<lina>
Apps, not so much
<phire>
if apple think Metal is a safety boundary....
chadmed has joined #asahi-gpu
<lina>
I mean, none of this matters for safety
<lina>
If it's wrong the GPU will just fault
<lina>
(Fault handling being terrible is a separate issue)
<phire>
I just really doubt paranoia is the justification. diffrent buffer layouts (store also resolving MSAA being the best example) is a much better reason
<alyssa>
03:02 < lina> Now I wonder if this for separate buffers for load/partial/store
<alyssa>
lina: Yes, that's it
<alyssa>
I thought I noted that down somewhere
<lina>
Come to think of it... I think aquarium is doing AA?
<alyssa>
yes
<lina>
Maybe not? Hard to tell with my cursed capture card
<phire>
could the ultra have forced MSAA on for some reason?
<lina>
Wait, does MSAA also need multiple samples in the depth buffer, or only color?
<phire>
yes, multiple depth samples
<alyssa>
yes
<lina>
OK, then it makes sense that it would have separate buffers
<alyssa>
i should sleep
<lina>
And then those extra mystery zls_control flags are probably related...
<phire>
depth-test and triangle edeges are per sample. color is duplicated over all samples (masked by depth/stencil/rasteriation)
<alyssa>
that means multiple samples in the tilebuffer but only one in memory
<alyssa>
which means 4x larger temp buffers for tvb overflow
<alyssa>
I'm curious if Apple does page fault tricks to get away without wasting that memory in the happy path
<lina>
No chance of that
<lina>
They don't have any kind of reasonable fault handling
<alyssa>
mali++
<lina>
There is an IRQ line that goes high on faults, but the firmware has a 250us timeout or something like that, so there's no way that would be reliable
<lina>
I guess tomorrow I should poke around those flags a bit and see what I find...
<phire>
even when not using the EXT_ms_rtt extention, that's still something a smart opengl implemenation can infer from the commands and optimise in.
<phire>
I have no idea how smart apple's opengl implementation is, and I assume it's all very explict in metal
<alyssa>
phire: no it's not?
<alyssa>
but yes metal builds in something like ms_rtt so safari should be using it fine
<phire>
"no it's not?" actually, you might be right. opengl makes it near impossible to tell the unresolved MSAA buffer won't be used again later
chadmed has quit [Quit: Konversation terminated!]
ella-0_ has joined #asahi-gpu
ella-0 has quit [Read error: Connection reset by peer]
Etrien__ has quit [Read error: Connection reset by peer]
Etrien has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
ckileeeeeeeeeeeeeeeeeeeeeeetb^ has joined #asahi-gpu
SSJ_GZ has joined #asahi-gpu
chadmed has joined #asahi-gpu
Etrien has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
chadmed has joined #asahi-gpu
<phire>
m1 max seems to just set all those accleration pointers as soon as the depth buffer is enabled
Etrien has quit [Read error: No route to host]
Etrien has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
LunaFoxgirlVT has quit [Read error: Connection reset by peer]
Etrien has quit [Ping timeout: 480 seconds]
Etrien has joined #asahi-gpu
Etrien has quit [Read error: Connection reset by peer]
Etrien has joined #asahi-gpu
will[m] has quit [Write error: connection closed]
ckileeeeeeeeeeeeeeeeeeeeeeetb^ has quit [Remote host closed the connection]
Etrien__ has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
Etrien has joined #asahi-gpu
Etrien__ has quit [Ping timeout: 480 seconds]
le0n_ has joined #asahi-gpu
cr1901 has quit [Read error: Connection reset by peer]
Etrien__ has quit [Read error: Connection reset by peer]
Etrien has joined #asahi-gpu
DarkShadow44 has quit [Ping timeout: 480 seconds]
uur has joined #asahi-gpu
<phire>
I said this on Twitter but I'll repeat it here:
<phire>
Surely there is a way to (safely) prevent reordering; compiler Instruction reordering barrier?
<phire>
asm volitile("" ::: "memory") // for c
<phire>
std::sync:: atomic::compiler_fence() // for rust
<phire>
Though, I do really like the "let's just do soft float" solution. People are too scared of soft float
uur has quit [Remote host closed the connection]
Etrien has quit [Ping timeout: 480 seconds]
<phire>
The preformance isn't even that bad (when you inline it and avoid unpacking/packing into the proper ieee754 format)
SSJ_GZ has quit [Ping timeout: 480 seconds]
<phire>
Lina: consider using a soft float working format of 64bit, sign in bit63, 8/16 bits of exponent at bit 32+, and 32bits of signficand in bits 0-31
<phire>
Would allow you pass it around (and even compare) as if it was a 64bit int, but then the compiler can quickly unpack it into it's components just by just treating it as two 32bit ints