<alyssa>
Oof, deqp-gles31 is hitting the OOM killer now. That's not great.
<alyssa>
Yeah. deqp-gles31 is going down in under a minute (8GB mini)
<lina>
Oh yeah, deqp-EGL hits the OOM killer on 8GB with >1 thread (or maybe 2 fit, barely), I don't know why...
<lina>
It's not a hard leak but I don't know how much memory it's *supposed* to use...
<alyssa>
now, as in, this is a regression
<alyssa>
unsure if it's related to sync. it's kinda hard to bisect over hard uapi changes.
<alyssa>
but this was working before at any rate
<alyssa>
and we can't release if the driver is like this. what we're shipping now was rock solid. if we can't even run basic test suites anymore without crashing, it's almost guaranteed that something will regress for a user if we push it out like this.
<lina>
So I'm going to move the SHAREABLE fix thing to before the batch tracking because that was actually regressed (well, it was always technically wrong, it just started mattering) by the UAPI change, not the actual explicit sync
<lina>
the flush_resource thing
<lina>
And I'll do something to make batches not roll around all the time. I think I can just do a poll on all submitted syncobjs twice after each submission (you can do several in one syscall, doing it twice means we potentially clean up 2 batches for each new batch, which should keep memory usage under control)
<lina>
I don't think we want to do anything fancy with IRQs delivered to userspace... the equivalent of that with the existing UAPI is to just have a thread poll on syncobjs and do the cleanup there, but that sounds like a "might be a good idea some day, probably not now" thing...
<alyssa>
:+1:
<alyssa>
and o look it's your stream o'clock which means it's my sleep o'clock
<lina>
^^
aomizu has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
systwi has quit [Ping timeout: 480 seconds]
<lina>
alyssa: So I think a lot of the regressions are actually from the git work/rebase for upstreaming the device stuff ^^;; (fd leaks...)
<lina>
dEQP-EGL was all kinds of broken now...
<lina>
And at least some are due to the flush_resource stuff
systwi has joined #asahi-gpu
aomizu has joined #asahi-gpu
Emantor_ is now known as Emantor
<lina>
Okay, the flush_resource thing was completely broken... Maybe that's what was breaking Sway... ^^;;
Ziemas has quit [Ping timeout: 480 seconds]
MajorBiscuit has joined #asahi-gpu
i509vcb has quit [Quit: Connection closed for inactivity]
aomizu has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
aomizu has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
<lina>
alyssa: I have a flake regression in dEQP-GLES2 and I don't know where it comes from... it's easy to reproduce running dEQP-GLES2.functional.fragment_ops.interaction.basic_shader.77 in a loop. The images have random bad pixels... any ideas? It still happens in sync mode, and also with the flush_resource stuff disabled...
<lina>
I don't recall ever seeing this back when I tested dEQP-GLES2 with the explicit sync stuff... could it be an upstream mesa regression recently? I tried just UAPI on top of mesa/mesa:main and it still happens...
Ziemas has joined #asahi-gpu
m42uko has quit []
m42uko has joined #asahi-gpu
nyilas has joined #asahi-gpu
<lina>
alyssa: I can't repro the pink square thing with stk but I can with nautilus. Fixed ^^
<lina>
It turns out you can render to a BO and *then* export it and so we need to post-facto insert a DMA-BUF fence into it at export time...
c10l has quit [Ping timeout: 480 seconds]
aomizu has joined #asahi-gpu
aomizu has quit []
nyilas has quit [Remote host closed the connection]
aomizu has joined #asahi-gpu
<lina>
alyssa: Ummm so I'm running into ASAN issues with constant buffers and blitting and I can't figure out why this isn't broken for everyone...
<lina>
OK so I fixed that one with a state tracker patch... but I still don't know why this isn't breaking other drivers...
<lina>
And now I'm running into others... and I'm getting the feeling none of this has anything to do with explicit sync and it's just all very broken and I don't get it...
aomizu has quit [Max SendQ exceeded]
aomizu has joined #asahi-gpu
<lina>
Sooo --deqp-surface-type=fbo is very, very broken... ;;
aomizu has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
kesslerd has joined #asahi-gpu
tertu has quit [Ping timeout: 480 seconds]
aomizu has joined #asahi-gpu
possiblemeatball has joined #asahi-gpu
hightower3 has quit [Ping timeout: 480 seconds]
hightower2 has joined #asahi-gpu
c10l has joined #asahi-gpu
aomizu has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
LinuxM1 has joined #asahi-gpu
<alyssa>
lina: dEQP-GLES2.functional.fragment_ops.interaction.basic_shader.77 could easily be an upstream regression, yeah
<alyssa>
maybe the colour masking opts are still not quite right
<alyssa>
I use --deqp-surface-type=pbuffer, same as CI, fwiw
LinuxM1 has quit []
<lina>
alyssa: I'm stuck on max draw buffers right now, TIB offsets >= 64 do not work. I'm going to macOS to see if I'm missing something here, even kernel side...
kesslerd has quit [Remote host closed the connection]
<alyssa>
lina: That's an existing fail, not a regression
<alyssa>
i mean you're welcome to debug it but it's not a blocker
<lina>
It wasn't on your existing GLES3 fail list... but I see it is in the GLES31 one...
<alyssa>
oh. yeah. lmao
<alyssa>
Those tests used to be in dEQP-GLES31 but then Khronos realized "Wait there's literally nothing in here that depends on GLES31" so they moved it to GLES3
<alyssa>
and I haven't uprevved deqp in a long time
<lina>
Oh...........
possiblemeatball has quit [Quit: Quit]
<alyssa>
still grateful for help debugging them because I got as far as "yeah tib offsets >= 64 broken ???? why ????"
<lina>
Well either way at least for GLES3 I'm pretty sure I don't have any sync-related regressions... it's that, some depth/stencil stuff, and some pack/unpack stuff. All of which sound completely unrelated to sync.
<lina>
I think I spent the past 2-3 hours getting there...
<alyssa>
Awesome :)
possiblemeatball has joined #asahi-gpu
<lina>
I do feel like sometimes I spend hours figuring out stuff you've already figured out ;;
<alyssa>
if I had figured it out, I would've fixed it ;)
<lina>
I mean I spent that long getting to where you were ;;
<alyssa>
oh, um
<alyssa>
i'm sorry to hear that
<lina>
I'm not sure how long it's been but it's been a while of digging through render target code to get to offsets > 64 (which I got to just now) ^^;;
<alyssa>
:\
<alyssa>
the way I got there was by looking at the results.xml for the failing test
<alyssa>
and seeing that all render targets were correct except the very last
<alyssa>
and then looking at the store for the very last and seeing it was offset 64 and thinking "yeah that's probably it then"
<alyssa>
and that's when I shrugged it off and moved onto other things
<lina>
Yeah... I saw only the last one was wrong, but it took me a while to even notice the 64 offset...
<alyssa>
i'm not sure what you want me to say/do
<lina>
Maybe we should have a notes doc on what's broken/why to avoid duplicate work ^^;;
<alyssa>
sounds like it would get out of sync very quickly
<lina>
alyssa: Pushed to my agx/next, I think that should be clean. Needs a kernel uprev due to a minor UAPI change.
<alyssa>
:+1:
<lina>
Top commit on that is some guesswork for the things we were still not setting properly in the cmdbuf, no idea how it interacts with multisampling.
<lina>
(Or whether it does anything()
<alyssa>
Sure
<alyssa>
Looks like I can squash it in with the "add UAPI"? :P
<alyssa>
er
<alyssa>
s/:P/:)/
<lina>
If you think it's correct, sure ^^
<lina>
Also I just force pushed
<alyssa>
I don't think it's incorrect! :p
<lina>
alyssa: Also I had to add an argument to 1228f92097d due to a mesa:main change, but it's just copypasta, please check that it's correct ^^;;
<alyssa>
looks good
<alyssa>
and yeah I held back on submitting the MR for that because I knew that the func sig was going to have to change
<lina>
Didn't get around to reviewing MRs, sorry... (nor to addressing all the feedback on the uapi-prep one, though at least I pushed updates to it with the bugfixes)
<alyssa>
sometimes that's how it goes :)
<alyssa>
lina: Latest agx/next fixes the magenta rectangles on STK, thanks :)
<lina>
Didn't repro on stk at all for me but it did on nautilus ^^
<alyssa>
:+1:
<alyssa>
I hope you agree in retrospect this was an important bug to fix that didn't really have anything to do with stk or nautilus :)
<lina>
Yes, also I hate implicit sync ^^;;;
<alyssa>
interesting
<alyssa>
I would have thought after all that you would hate explicit sync instead :p
<lina>
Explicit sync was easy...
<lina>
Making it work with the implicit sync world on the other side on the other hand...
<alyssa>
ah, well, yes
jhan has joined #asahi-gpu
<alyssa>
deqp results look good again
<alyssa>
^31
<alyssa>
some regressions for me to fix but back to a world of sanity at least
<lina>
Also I think iris has the magenta rectangles bug, I don't see anything in that MR to address the case that fixed it ^^
<alyssa>
Uh oh!
<alyssa>
lina: but isn't that our thing?
<alyssa>
Doing things better than the industry? :~)
<lina>
^^
<alyssa>
lina: btw on the subject of known fails
<alyssa>
I *think* I figured out what the deqp-gles3 cube maps fails are about, *maybe*
<alyssa>
seamless cube maps are supposed to ignore the wrap mode in GL and VK
<alyssa>
but I think for some reason AGX (and Metal?) honour the repeat wrap mode even for seamless cube maps
<alyssa>
which is all kinds of wrong
<alyssa>
if this is indeed the issue -- I haven't confirmed it, it's basically just from staring at the fails/pass list -- then for GL it shouldn't be too bad to fix
<alyssa>
add a hack in mesa/st to fix up the wrap mode and it
<alyssa>
'll be ok
<alyssa>
for VK it'll be a bit more annoying because we have no way of knowing whether a given sampler is for a cube map or not
<alyssa>
the only solution I can come up with for Vulkan is duplicating samplers to have a regular version and a cube map version, and toggle between the two in the shader depending on whether loading from a cube map or not
<alyssa>
which is very much not great but not catastrophic
<alyssa>
However.. I don't see any references to this issue in Metal docs nor MoltenVK nor ANGLE
<alyssa>
and I don't see the Apple Metal blob fixing this up
<alyssa>
so it's hard to say what's actually happening here
<alyssa>
it's entirely possible that there's, like, a bit in some control register to toggle this behaviour
<alyssa>
IDK
jhan has quit [Ping timeout: 480 seconds]
<alyssa>
RGX has a TPU control register, maybe AGX does too, idk
kesslerd has joined #asahi-gpu
<alyssa>
really hoping AGX has a TPU control register that can toggle this behaviour tbh
<alyssa>
I can implement the workarounds if needed but I'd much rather not
<alyssa>
RGX has this bit
<alyssa>
TAG_CEMEDGE_DONTFILTER
<alyssa>
Pixel data master. Disable filtering over edges/corners for CEM. When set to 1, HW will be seemfull, ie, always stay in the current map, always honour
<alyssa>
the addressmode. When set to 0, HW will be seemless, ie, ignore addressmode, filter between faces at the edges/corners
i509vcb has joined #asahi-gpu
<alyssa>
That's not what we're looking for but it gives a sense that AGX might have an analogous reg
<alyssa>
lina: rsrc->bo can be corrupted in agx_batch_cleanup
<alyssa>
==53478== at 0x5FBD6F0: ??? (in /home/alyssa/mesa/build/src/gallium/targets/dri/asahi_dri.so)
<alyssa>
==53478== Address 0x7b45180 is 128 bytes inside a block of size 480 free'd
<alyssa>
==53478== at 0x4887B40: free (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
<alyssa>
==53478== by 0x5DDD3DF: ??? (in /home/alyssa/mesa/build/src/gallium/targets/dri/asahi_dri.so)
<alyssa>
==53478== Block was alloc'd at
<alyssa>
==53478== at 0x4889F94: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
<alyssa>
==53478== by 0x5FC1687: ??? (in /home/alyssa/mesa/build/src/gallium/targets/dri/asahi_dri.so)
<alyssa>
use-after-free, yeet0
<alyssa>
oh because we're reference counting BOs, but not resources
<alyssa>
so this is busted
<alyssa>
how did this work before
<lina>
I had enough crazy memory safety issues today... (see MR, you have no idea how long I spent tracking down that state tracker/blitter interaction one and deciding how to fix it...)
<lina>
You get to figure this one out ^^
<alyssa>
haha alright :)
<alyssa>
get some sleep, sis :)
<alyssa>
oh ugh ok this is very broken
<alyssa>
in panfrost
<lina>
Oh yeah the fence stuff is definitely broken in panfrost (at least the missing null checks in fence_reference) FYI
<lina>
Story of explicit sync: cargo cult code from other drivers then realize they're all broken ^^ (first panfrost, then iris/xe)
opticron has quit [Ping timeout: 480 seconds]
<alyssa>
s/explicit//
<alyssa>
I'm just going to remove the hash table
<alyssa>
Avoid a can of worms that way
<alyssa>
Probably faster too
<lina>
alyssa: Make sure to keep the writer_syncobj clear in that path somehow, that's important (without it you could end up importing the wrong fence on export, we don't want to import if there is no current writer)
<lina>
I'm not sure how bad that would be but I can see it being anywhere from a perf issue to a deadlock issue.
<alyssa>
Hm?
<alyssa>
oh
<alyssa>
yeah, I'm copying your existing logic
<alyssa>
just changing the data structures to something with a lot less pointers
<lina>
Got it ^^
<lina>
And I should get some sleep
<alyssa>
to avoid reference count hell
<alyssa>
nini
<lina>
nini~!
<alyssa>
glcts: ../src/gallium/drivers/asahi/agx_state.h:399: agx_writer_add: Assertion `(*value) == 0 && "there should be no existing writer"' failed.
<alyssa>
(pushed mesa branch "splat", run deqp-gles31 with deqp-runner and it crashed in a few minutes)
<alyssa>
(It's a broken Mesa branch but that shouldn't be able to provoke.. that)
<alyssa>
oddly deqp-gles3 results look great
Misthios has quit [Quit: Misthios]
Misthios has joined #asahi-gpu
Misthios has quit []
Misthios has joined #asahi-gpu
<alyssa>
fixed the packing issue
<alyssa>
too much late night code for me :p
nyilas has joined #asahi-gpu
jhan has joined #asahi-gpu
jhan has quit [Ping timeout: 480 seconds]
tertu has joined #asahi-gpu
opticron has joined #asahi-gpu
louisadamian has joined #asahi-gpu
<Tramtrist>
you have gained +5 follows from yesterday in case you havent checked :p
louisadamian has quit [Remote host closed the connection]
louisadamian has joined #asahi-gpu
<alyssa>
:D
<alyssa>
lina: trying to run piglit is crashing the gpu within minutes
<alyssa>
kernel bugs? firmware bugs? both?
<alyssa>
don't think this is related to explicit sync, I think the mesa side regressed a few months back and exposed the kernel bug
<alyssa>
but it's kinda hard to find/fix the mesa regression when the gpu crashes
louisadamian has quit [Remote host closed the connection]
<alyssa>
(and without being able to run piglit I can't tell how we're doing regressionwise since the last release. piglit results were crap then too but not "the gpu crashes in minutes" bad)
<alyssa>
at any rate I have a wip branch that gets deqp{2,3,31} where they should be
<alyssa>
so piglit is the next (and hopefully snafu before we can release
<alyssa>
^last)
louisadamian has joined #asahi-gpu
<alyssa>
there's a LOT of changes this time around.. I don't want regressions :)
<alyssa>
and i keep reading about people saying the gpu support is flawless, I want to be sure it stays that way ^^
jhan has joined #asahi-gpu
louisadamian has quit [Remote host closed the connection]