ChanServ changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
<jannau> randomly while experimenting with yuv overlays in dcp.py
Dcow has quit [Remote host closed the connection]
Dcow has joined #asahi-gpu
<alyssa> Ah lol nice
Etrien has quit [Ping timeout: 480 seconds]
<alyssa> Is there any Linux userspace that can use YUV overlays these days? :(
<jannau> same flag needs to be set for yuv overlays
<jannau> wayland compositor are supposed to use them for video playback via Linux DMA-Buf protocol
<daniels> alyssa: yeah, weston
amarioguy has joined #asahi-gpu
Dcow has quit [Remote host closed the connection]
Dcow has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
Dcow has joined #asahi-gpu
chadmed has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
<phire> lina: so I got mesa running under osx on my M1 Max. Seems to work just fine without your patch that hacks in the acceleration buffers
<phire> so that might be an Ultra "two dies" only change
<phire> I should get this running under hypervisor. I suspect those other buffers will also be absent
<phire> Theory: However apple are sharing 3d workloads across the two dies, it might only work for compressed framebuffers?
Etrien has joined #asahi-gpu
<alyssa> s/framebuffer/depth stencil/, lina's patch doesn't touch the colour attachments
<alyssa> and that's entirely possible, there are lots of funny limitations here
<phire> yeah, I found that weird. It was only touching depth+stencil
<alyssa> Note that there is dedicated depth/stencil hardware (the ZLS)
<alyssa> there is not dedicated colour buffer hardware
Dcow has joined #asahi-gpu
<alyssa> the closest equivalent is the PBE which handles image writes in general
<alyssa> but the ZLS knows what depth buffers and stencil buffers are,
<alyssa> the PBE just writes images
Etrien__ has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
Dcow has quit [Ping timeout: 480 seconds]
<phire> I guess all the depth stuff is already fixed-function anyway, due to the deferred shading nature
<phire> ZLS = Z load store?
<alyssa> Presumably
chadmed has quit [Quit: Konversation terminated!]
<lina> phire: It "works" but it also faults sometimes. I did get frames out of it without the patch, but the hypervisor showed GPU faults.
<phire> oh, point. I should chuck this into the hypervisor later
<lina> Keep in mind that the depth buffer, unless the app really needs it, is only written to on TVB overflows
<lina> And those go away after a few frames
<alyssa> 02:10 < lina> Keep in mind that the depth buffer, unless the app really needs it, is only written to on TVB overflows
<alyssa> lina: sorry what?
<lina> If the app isn't actually reading out the depth buffer there is no need to flush it out to RAM
<lina> With my bunny renders, the depth buffer is unwritten if I make the TVB large enough that it can render in a single pass, I only get data there if it ends up splitting due to a small TVB
<alyssa> Hm.
<alyssa> 02:11 < lina> If the app isn't actually reading out the depth buffer there is no need to flush it out to RAM
<alyssa> That's an optimization that gallium/asahi isn't doing yet, though.
<alyssa> With current Mesa, the intention is to always flush to RAM
<lina> It sure is with my bunny framedumps...
<alyssa> (yes I know that's wasteful)
<alyssa> So I see two possibilities
<alyssa> 1. I screwed up Mesa and this isn't working on macOS, but in that case I'd expect to see more CTS fails
<alyssa> 2. Mesa is working ok, but your driver/Python isn't setting the right control register bits to make this work
<lina> Mind you, the mesa I'm using for python testing is very outdated right now, so if you fixed this recently, that might explain it. Or I got the depth flags wrong.
<lina> Or... hold on.
<phire> or there are two depth buffers?
<lina> Ohhh....
<phire> final output, and in-progress?
<lina> alyssa: The depth buffer *is* written unconditionally. However, it's written uncompressed!
<lina> So only the intermediate flushes use the acceleration buffer!
<lina> So you probably found the flag to make Z uncompressed on *final store*, and I am setting that right
<lina> But you're still faulting on intermediate flushes, before the TVB grows in Mesa, without the metadata buffer
<lina> And that's why it looks like it works
<lina> You just fault for a few frames iff the geometry is complex enough
<alyssa> Oh, geez
LunaFoxgirlVT has joined #asahi-gpu
<alyssa> That seems plausible
<lina> But I still haven't seen this on M1, only Ultra
<alyssa> right
<lina> So maybe intermediate Z flushes on Ultra using compression is a new G13X thing
<alyssa> I could believe that
<alyssa> Although it does seem odd to change any of this *within* a major arch,
<lina> It could be a config change
<alyssa> though maybe it's specced out for G13 but it's a model-specific feature bit
<lina> Either global or per batch
<lina> Let me look at what changed...
<phire> there were a few unks that flipped from 0 to 1
<lina> And at least a couple general bitmask like things with different values
<alyssa> There's precedent for G13P missing capabilities that G13G has, I think)
<LunaFoxgirlVT> ^•w•^
<alyssa> LunaFoxgirlVT: Nya!~
<LunaFoxgirlVT> heyo
<phire> hey LunaFoxgirlVT
<phire> though I understand apple has a habbit of just not enabling features that weren't ready in time, and never enabling them
<alyssa> LunaFoxgirlVT: we haven't talked in a while but i wanted to let you know that my sister really likes inochi2d o:)
<LunaFoxgirlVT> alyssa: こんこん
<phire> but given the Pro, Max and Ultra all have the exact same gpu name (G13X), I suspect they share features
<LunaFoxgirlVT> alyssa: I'm in call with her right now lol
<alyssa> LunaFoxgirlVT: ^_^
<alyssa> phire: G13X isn't the name
<alyssa> the X is a wildcard
<alyssa> G13D, G13S, and i forget the other one
<phire> really?
<alyssa> yes
<LunaFoxgirlVT> Excited for the display port stuff to get merged some day so that I can see stuff on my mac mini conveniently lol
<LunaFoxgirlVT> At some point I gotta start building and releasing arm linux releases of Inochi2D =w=
<LunaFoxgirlVT> but it already runs on a raspberry pi 4 so it's a good start
<lina> Also I just realized my bunny is very, very broken with TVB overflows...
<phire> lina: did you push your changes to m1n1 somewhere?
<lina> Not yet
<lina> Need to clean it up
<phire> yeah, figured
<phire> I can probally check from macos if my m1 max is allocating those accleration structures
<lina> Yeah, they show up in the command buffers
<lina> I also need to check again on the M1 Mini, since I didn't actually look there
<lina> I just don't see them in any HV logs I have lying around
<phire> guess that's a task for tomorrows stream
<alyssa> I am curious, is the value of zls_control different on g13x from g13g?
<lina> I just got there w
<lina> Setting zls_control differently fixes the bunny on TVB overflows
<lina> Looks like it needs 0x44 ORed in
dviola has quit [Ping timeout: 480 seconds]
<lina> (Still storing to the accel buffer either way though)
<alyssa> Ok, I definitely don't see those bits on G13G
<alyssa> 0x44 in the RGX zls_control is "zloaden | forcezstore"
<lina> Oh, I have logs with those flags in the past... but I don't recall ever taking logs on an M1 Ultra before?
<alyssa> which is possible but maybe unlikely
<lina> Interesting...
<alyssa> I think 0x44 is just the general Z compression enable
<alyssa> 02:32 < phire> I can probally check from macos if my m1 max is allocating those accleration structures
<lina> And yet I still get it using the compression buffer without that
<alyssa> phire: very curious to the answer here if you can manage to get a linear depth buffer
<lina> It just breaks
<alyssa> lina: \shrug/
<lina> Also I found logs with 0x80044 that were definitely taken on an M1...
<alyssa> the real solution here is to r/e compression on g13g and wire that up on g13g
<phire> lina: is that cute flag demo somewhere?
<lina> Don't think so w
<lina> *And* those logs don't have the compression meta buffer on G13G
<lina> No wait they do
<phire> um... wat. is this just something that was always broken?
<lina> OK, so this is related to compression... but then why is G13D enabling compression even without them, partially, then breaking?
<lina> I see a 1:1 correlation between 0x44 and the accel/meta buffer existing
<phire> also, who is forcing compression on? the kernel driver, or the agx firmware?
<lina> In old G13G traces (that I now found - there are few of them)
<lina> phire: Partially, even!
chadmed has joined #asahi-gpu
<alyssa> 02:44 < lina> I see a 1:1 correlation between 0x44 and the accel/meta buffer existing
<phire> seperate control over TVB overflow and final depthbuffer is a useful feature
<alyssa> Ok, that jives with what i've seen on g
<lina> Yeah, but not having 0x44 seems to break the overflow, like it stores compressed but loads uncompressed.
chadmed has quit []
<lina> Actually this fits the image I'm seeing... I get lots of mini-tiles with stuff in the top left.
<lina> That's... exactly what you would expect of compressed data loaded as uncompressed
<alyssa> eyes emoji
<lina> Maybe this is an outright bug? G13X always does partial depth stores compressed?
<LunaFoxgirlVT> data has been squimshed (´・ω・)
<phire> agx firmware bug? I'd be supprised if it was a hardware bug
<lina> Firmware bug would be kind of weird... especially since it's a shared codebase
<phire> but hardware bug would be weirder. how would the PBE know the diffrence between a final store and a TVB flush store?
<phire> unless... what if there is a hardware bug loading non-compressed depth/stencil buffers, and this is a workaround?
<lina> But then why is *store* being forced on, but not *load*?
<phire> hmmm.... weird
<phire> also, this is a waste of memory, having seperate buffers for the TVB overflows and the final depth
<lina> They aren't separate, the extra buffer is the *metadata* buffer
<lina> Which is needed for compression
<phire> so that means the buffer has the same layout when compressed?
<lina> The main buffer is the same, it just has compressed tiles and the metadata buffer tells it how they were compressed.
<lina> The metadata buffer has one byte per 8x4 pixel block in the main buffer
<lina> This is pretty typical framebuffer compression stuff
<phire> ah, makes sense. so you can fallback to uncompressed if compression fails to produce a smaller result
<lina> Yes, exactly
<lina> You always need more space any time you use compression, by definition
<alyssa> Now I want to know what macOS on m1 ultra does with the "do not compress depth" magic with metal
<lina> FB compression isn't about size, it's about bandwidth, so the buffers are actually slightly larger (extra meta)
<alyssa> I honestly can't remember how I got it do that
<phire> and framebuffer compresion *only* cares about reducing memory bandwidth, not buffer size
<alyssa> something about rendering to a texture with usage Unknown and allowGPU = false
<lina> What's a good macOS app to exercise the depth buffer? I just did boot->safari->youtube and there isn't a single render with depth...
<lina> Some random webGL demo?
* lina opens up WebGL Aquarium
<lina> depth_flags = 0x5000000000c4154
<lina> What the heck. Well, that's a new one.
<lina> I see the stencil aux buffers too, so this was good to name those fields...
<lina> Huh.
<lina> alyssa: Remember how there is more than one depth buffer?
<lina> depth_buffer_ptr1 = 0x1501c68000
<lina> depth_buffer_ptr2 = 0x1501750000
<lina> depth_buffer_ptr3 = 0x1501c68000
<lina> Now I wonder if this for separate buffers for load/partial/store
<lina> And then similarly the meta buffers are separate
<lina> Since some could be compressed and not others
<phire> that's what I was already wondering. though why is it so keen to avoid writing to the store buffer until the end?
<phire> oh... multisample resolve?
<lina> It makes sense to compress the intermediate buffers but not the final ones if the app needs it for some reason
<lina> And I guess under some conditions maybe you can't guarantee that the final buffer is "safe"? Like, the GPU can't trust it to read back during an internal reload?
<lina> So you'd want to use a separate buffer from the user-provided buffer in that case
<phire> yes, but if you are doing that, might as well use the same buffer for both partial and store. It's only if the final store buffer has a diffrent layout for some reason that you need seperate buffers
<phire> safe? that's very paranoid. but I guess
<lina> Yeah, so maybe if the app provides a custom layout Z buffer it can't use compression, and then you need a separate buffer for internal
<phire> wait, these are all supplied from userspace, so app has control over the partial anyway
<lina> Metal does
<lina> Apps, not so much
<phire> if apple think Metal is a safety boundary....
chadmed has joined #asahi-gpu
<lina> I mean, none of this matters for safety
<lina> If it's wrong the GPU will just fault
<lina> (Fault handling being terrible is a separate issue)
<phire> I just really doubt paranoia is the justification. diffrent buffer layouts (store also resolving MSAA being the best example) is a much better reason
<alyssa> 03:02 < lina> Now I wonder if this for separate buffers for load/partial/store
<alyssa> lina: Yes, that's it
<alyssa> I thought I noted that down somewhere
<lina> Come to think of it... I think aquarium is doing AA?
<alyssa> yes
<lina> Maybe not? Hard to tell with my cursed capture card
<phire> could the ultra have forced MSAA on for some reason?
<lina> Wait, does MSAA also need multiple samples in the depth buffer, or only color?
<phire> yes, multiple depth samples
<alyssa> yes
<lina> OK, then it makes sense that it would have separate buffers
<alyssa> i should sleep
<lina> And then those extra mystery zls_control flags are probably related...
<phire> depth-test and triangle edeges are per sample. color is duplicated over all samples (masked by depth/stencil/rasteriation)
<lina> alyssa: You should...
<phire> cya alyssa
<alyssa> lina: ohhh that's an excellent point
<alyssa> EXT_ms_rtt
<alyssa> that means multiple samples in the tilebuffer but only one in memory
<alyssa> which means 4x larger temp buffers for tvb overflow
<alyssa> I'm curious if Apple does page fault tricks to get away without wasting that memory in the happy path
<lina> No chance of that
<lina> They don't have any kind of reasonable fault handling
<alyssa> mali++
<lina> There is an IRQ line that goes high on faults, but the firmware has a 250us timeout or something like that, so there's no way that would be reliable
<lina> I guess tomorrow I should poke around those flags a bit and see what I find...
<phire> even when not using the EXT_ms_rtt extention, that's still something a smart opengl implemenation can infer from the commands and optimise in.
<phire> I have no idea how smart apple's opengl implementation is, and I assume it's all very explict in metal
<alyssa> phire: no it's not?
<alyssa> but yes metal builds in something like ms_rtt so safari should be using it fine
<phire> "no it's not?" actually, you might be right. opengl makes it near impossible to tell the unresolved MSAA buffer won't be used again later
chadmed has quit [Quit: Konversation terminated!]
ella-0_ has joined #asahi-gpu
ella-0 has quit [Read error: Connection reset by peer]
Etrien__ has quit [Read error: Connection reset by peer]
Etrien has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
ckileeeeeeeeeeeeeeeeeeeeeeetb^ has joined #asahi-gpu
SSJ_GZ has joined #asahi-gpu
chadmed has joined #asahi-gpu
Etrien has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
chadmed has joined #asahi-gpu
<phire> m1 max seems to just set all those accleration pointers as soon as the depth buffer is enabled
Etrien has quit [Read error: No route to host]
Etrien has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
LunaFoxgirlVT has quit [Read error: Connection reset by peer]
Etrien has quit [Ping timeout: 480 seconds]
Etrien has joined #asahi-gpu
Etrien has quit [Read error: Connection reset by peer]
Etrien has joined #asahi-gpu
will[m] has quit [Write error: connection closed]
ckileeeeeeeeeeeeeeeeeeeeeeetb^ has quit [Remote host closed the connection]
Etrien__ has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
Etrien has joined #asahi-gpu
Etrien__ has quit [Ping timeout: 480 seconds]
le0n_ has joined #asahi-gpu
cr1901 has quit [Read error: Connection reset by peer]
le0n has quit [Ping timeout: 480 seconds]
chadmed has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
r0ni has joined #asahi-gpu
Dcow has joined #asahi-gpu
Dcow_ has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
r0ni has quit [Quit: Textual IRC Client: www.textualapp.com]
sneak has joined #asahi-gpu
cr1901 has joined #asahi-gpu
Dcow has joined #asahi-gpu
Dcow_ has quit [Ping timeout: 480 seconds]
yamii has quit [Remote host closed the connection]
yamii has joined #asahi-gpu
<alyssa> curious
sneak has quit [Quit: ZNC 1.7.5 - https://znc.in]
<alyssa> lina: "maybe figure out how to do FP math in the kernel"
<alyssa> ooh boy
sneak has joined #asahi-gpu
<alyssa> sounds fun ;-)
* alyssa trusts your Rust code to get it right more than any of the kernel C code doing FP math
mrkajetanp has joined #asahi-gpu
mrkajetanp has quit [Remote host closed the connection]
Gaspare has joined #asahi-gpu
mkurz has joined #asahi-gpu
mkurz has quit [Quit: Leaving]
Gaspare has quit [Quit: Gaspare]
Dcow_ has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
alyssa has quit [Quit: leaving]
Gaspare has joined #asahi-gpu
balrog_ has quit []
balrog has joined #asahi-gpu
cy8aer has quit [Remote host closed the connection]
cy8aer has joined #asahi-gpu
r0ni has joined #asahi-gpu
Dcow has joined #asahi-gpu
Dcow_ has quit [Ping timeout: 480 seconds]
uur has joined #asahi-gpu
uur has quit []
Etrien__ has joined #asahi-gpu
ella-0_ is now known as ella-0
Etrien has quit [Ping timeout: 480 seconds]
DarkShadow44 has quit [Quit: ZNC - https://znc.in]
DarkShadow44 has joined #asahi-gpu
DarkShadow4444 has joined #asahi-gpu
DarkShadow4444 has quit []
DarkShadow4444 has joined #asahi-gpu
Etrien__ has quit [Read error: Connection reset by peer]
Etrien has joined #asahi-gpu
DarkShadow44 has quit [Ping timeout: 480 seconds]
uur has joined #asahi-gpu
<phire> I said this on Twitter but I'll repeat it here:
<phire> Surely there is a way to (safely) prevent reordering; compiler Instruction reordering barrier?
<phire> asm volitile("" ::: "memory") // for c
<phire> std::sync:: atomic::compiler_fence() // for rust
<phire> Though, I do really like the "let's just do soft float" solution. People are too scared of soft float
uur has quit [Remote host closed the connection]
Etrien has quit [Ping timeout: 480 seconds]
<phire> The preformance isn't even that bad (when you inline it and avoid unpacking/packing into the proper ieee754 format)
SSJ_GZ has quit [Ping timeout: 480 seconds]
<phire> Lina: consider using a soft float working format of 64bit, sign in bit63, 8/16 bits of exponent at bit 32+, and 32bits of signficand in bits 0-31
<phire> Would allow you pass it around (and even compare) as if it was a 64bit int, but then the compiler can quickly unpack it into it's components just by just treating it as two 32bit ints
Gaspare has quit [Ping timeout: 480 seconds]
Etrien has joined #asahi-gpu
Etrien__ has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]