ChanServ changed the topic of #asahi-gpu to: Asahi Linux GPU development (no user support, NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
<alyssa> fedora is my new love
* alyssa got the valgrind symbols
jlco has joined #asahi-gpu
darkapex has joined #asahi-gpu
nsklaus_ has quit [Remote host closed the connection]
<alyssa> kicking off another run with the leak fixes
<alyssa> lina: jannau: we should coordinate who's doing which hardware
<alyssa> I'll supply classic M1
<alyssa> jannau: said you can do M1 Max/Ultra/M2
<alyssa> lina: that leaves you with M1 Pro, if we even want a separate run for that?
mini_ has quit [Quit: ZNC closing...]
mini_ has joined #asahi-gpu
<alyssa> Oh come on. Failed in the exact same place.
<alyssa> evidently did not fix the leaks hard enough
mini_ has quit []
<alyssa> lina: This is interesting
mini_ has joined #asahi-gpu
<alyssa> If I `./glcts -n dEQP-GLES31.functional.*`, I get "collect_garbage: failed to reserve space" in under 5 minutes
<alyssa> (Well before it gets to the spicy Big Framebuffer test that causes the run to fail.)
<alyssa> and this is despite low, ~constant memory usage according to top.
mini_ has quit []
mini_ has joined #asahi-gpu
cylm has joined #asahi-gpu
thevar1able_ has quit [Remote host closed the connection]
thevar1able_ has joined #asahi-gpu
c10l7 has joined #asahi-gpu
c10l has quit [Ping timeout: 480 seconds]
pyropeter3 has joined #asahi-gpu
pyropeter2 has quit [Ping timeout: 480 seconds]
<lina> I can do M2 and M1 Pro, which given our call earlier sounds like all we need with your M1? Or if jannau runs M2 and M1 Max, that covers it as well, or any other combination.
<lina> (We agreed we only need one each of t8103, t600x, t8112, and eventually t602x)
<lina> Also conclusion about that OOM thing is that it was the 4K kernel... which is not very surprising since it has known bugs.
<lina> The actual driver codepath there is just a GFP_KERNEL alloc with lots of memory free/cached is failing, which is definitely not a problem with the driver itself, but I wouldn't be surprised if the 4K/16K shmem thing is breaking memory management...
lina has quit [Quit: Lost terminal]
possiblemeatball has joined #asahi-gpu
possiblemeatball has quit [Quit: Quit]
lina has joined #asahi-gpu
<jannau> I was under the impression that we wanted to test t6000/t6001/t6002 separately under the assumption that we effectively test Apple's firmware as well and probably hit different execution paths
<jannau> I'd be surprised if "there are twice as many clusters" is the only difference on the ultra
<jannau> lina: piglit run deqp_gles3 seems to hang at test 11377, using VK-GL-CTS opengl-es-cts-3.2.9 DEQP_TARGET=wayland on m1 ultra
xcpy0 has quit [Quit: Ping timeout (120 seconds)]
<jannau> kernel includes yesterday's drm sched fix
xcpy0 has joined #asahi-gpu
<jannau> seems related to "asahi 406400000.gpu: QueueJob 651041: Job timed out on the DRM scheduler, things will probably break (ran: true)"
<jannau> are "Unknown event message" interesting? There is one (reproducible) ~100 seconds prior
<jannau> 'deqp-gles3/functional/shaders/builtin_functions/precision/matrixcompmult/highp_fragment/mat3' seems to be the hanging test. I do not want to disturb the cts run so I will try to confirm that later
cylm has quit [Ping timeout: 480 seconds]
<jannau> test passes in the cts run with alyssa's es31 branch and VK-GL-CTS opengl-es-cts-3.2.9.3 with x11_egl
<jannau> so it could be either a mesa driver bug, a wayland wsi bug or a VK-GL-CTS bug
<lina> jannau: "Unknown event" is something we can't handle, usually dynamic TVB allocation or similar, so it's no surprise it would hang after that.
<lina> Is it reproducible if you increase asahi.initial_tvb_size=64 or something like that?
<lina> I think the logic to compute the minimum initial TVB size is not complete
<lina> Do you know if the hanging test uses MSAA? I wouldn't be surprised if I need a *sample_count factor in there or something
<lina> Re CTS runs, I think it's largely the same execution paths for all of a series, plus it's not like we're submitting conformance separately for each firmware version either...
<lina> I think Ultra really is mostly the same execution path when it comes to rendering, just initialization is probably different.
<lina> Also I know the initial TVB size is different with clustering, so it wouldn't surprise me if this breaks on Ultra only (or on the whole Pro/Max/Ultra series)
i509vcb has quit [Quit: Connection closed for inactivity]
mkurz has quit [Ping timeout: 480 seconds]
nsklaus_ has joined #asahi-gpu
lina has quit [Ping timeout: 480 seconds]
<jannau> all precision/matrixcompmult/highp_fragment/mat3* tests fail but mat2* / mat4* don't, I doubt the tests use MSAA
<jannau> tests pass with asahi.initial_tvb_size=64
mkurz has joined #asahi-gpu
chadmed has quit [Remote host closed the connection]
c10l7 has quit []
c10l7 has joined #asahi-gpu
c10l7 has quit []
c10l7 has joined #asahi-gpu
cylm has joined #asahi-gpu
kode54 has left #asahi-gpu [#asahi-gpu]
kujeger has quit [Quit: ZNC 1.8.2 - https://znc.in]
chadmed has joined #asahi-gpu
mkurz has quit [Ping timeout: 480 seconds]
c10l7 has quit []
c10l7 has joined #asahi-gpu
lina has joined #asahi-gpu
thelounge60655 has joined #asahi-gpu
<lina> jannau: Okay, so it's the TVB size thing... what machine is this?
thelounge6065 has quit [Ping timeout: 480 seconds]
thelounge60655 is now known as thelounge6065
kujeger has joined #asahi-gpu
<_jannau__> lina: mac studio m1 ultra, 48 gpu cores
<_jannau__> cts results are worse on t6002 compared to t8103, testing now t6001
<_jannau__> that is with asahi.initial_tvb_size=64 and without unknown event messages
<lina> What is failing?
mkurz has joined #asahi-gpu
<jannau> lina: quite a few atomic_counter tests
jlco has quit [Ping timeout: 480 seconds]
<jannau> results on m1 max looked very similar or identical
jlco has joined #asahi-gpu
<alyssa> spicy.
tirr has joined #asahi-gpu
<jannau> at least this ones: https://paste.debian.net/1284465/ - rerunning the cts atm due to mesa config misshap on my side
<alyssa> jannau: Those tests at the end are interesting
<alyssa> They're all compute shader tests, but tests with multiple invocations/groups
<alyssa> there are also single invocation/group versions of all those, which I assume are passing
<alyssa> All of these tests use atomics.
<jannau> yes, the other tests in the same group fail, also only ~60% of the atomic_counter tests with multiple threads fail
<jannau> err, the other compute shader tests in the same group pass
<alyssa> this really could be either Mesa or kernel, tbh
<jannau> is it expected that dEQP-GLES3.functional.flush_finish.flush_wait takes a long time (several seconds) and makes desktop use slow?
<alyssa> Yes.
<alyssa> flush_finish.* are the stupid tests that we skip in CI
<alyssa> thinking either the kernel is configuring the CDM badly on t600x, or Mesa is doing something silly, or..
<alyssa> Or, geez, I wonder
<alyssa> I wonder if compute kernels with atomics need to be pinned to a single cluster for correct results
<alyssa> or need to have barriers on multi-cluster parts that aren't needed on single-cluster parts
<alyssa> because the atomics aren't coherent across clusters
<alyssa> and so the compiler needs to report whether atomics are used, and if so either
<alyssa> 1. add more barriers?
<alyssa> 2. set a flag in the CDM dispatch structs forcing things to stay together?
<alyssa> 3. pass a flag to the kernel telling the kernel to pass a flag to the hardware to force things together?
<alyssa> the way to go about checking this is to diff the agxdecode on t600x for a compute kernel without atomic + compute kernel with
<alyssa> and see if the shaders diff (other than adding an atomic, obviously) and/or if the structures differ
<alyssa> lina: ^
alyssa has quit [Quit: leaving]
alyssa has joined #asahi-gpu
alyssa_ has joined #asahi-gpu
alyssa_ has left #asahi-gpu [#asahi-gpu]
alyssa_ has joined #asahi-gpu
<alyssa_> foo
<alyssa_> alyssa: bar
<alyssa> ok, that does what I want
<jannau> alyssa: btw "static_assert(len_words == 4, "64-bit pointer");" doesn't compile on arch since gcc can't see that len_words compile time constant is
<alyssa> jannau: dammit
<jannau> in the case we need commit accurate mesa version information for cts
<alyssa> alright.
<alyssa> will fix that
<jannau> m2 seems to hang on 'dEQP-EGL.functional.resize.surface_size.shrink'
<alyssa> hang how?
<jannau> cts-runner doesn't progress for at least 3 minutes, with the last printed test dEQP-EGL.functional.resize.surface_size.shrink
alyssa has quit [Quit: alyssa]
alyssa_ has quit [Remote host closed the connection]
alyssa has joined #asahi-gpu
<alyssa> _jannau__: curious
<jannau> piglit deqp_gles31 results on t8112: 8x fail, 7x crash https://paste.debian.net/1284478/
dylanchapell has joined #asahi-gpu
<jannau> piglit deqp_egl results on t8112: 1x fail, 1x crash, 2x hangs (both surface_size "crashes"( https://paste.debian.net/1284481/
<alyssa> why piglit btw?
<alyssa> I have `run-deqp31` aliased to
<alyssa> MESA_SHADER_CACHE_DISABLE=1 EGL_PLATFORM=surfaceless deqp-runner run --deqp ~/GLCTS/glcts --output output --jobs 8 --renderer-check G13 --caselist ~/GLCTS/gl_cts/data/mustpass/gles/aosp_mustpass/main/gles31-master.txt --skips ~/mesa/.gitlab-ci/all-skips.txt -- --deqp-surface-type=pbuffer --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 --deqp-surface-height=256
<alyssa> --deqp-log-images=disable --deqp-log-shader-sources=disable
alyssa_ has joined #asahi-gpu
<alyssa_> alyssa: bloop
<alyssa> bloop
alyssa_ has quit []
<alyssa> jannau: as for the deqp-gles31 results--
<alyssa> that is passing to me
<alyssa> and run-deqp31 would indicate that as passing
<alyssa> the image atomic comp swap return value tests are known broken and not included in the mustpass list, piglit is only running them because piglit is wrong
<alyssa> and the stress tests are also not included in mustpass, only the functional ones contribute to conformance
<alyssa> it'd be great if we passed stress too but it's not part of the CTS currently
<jannau> because I have a bad memory, it's noted somewhere in the developer section on the website and it's at least nicer than running it from vk-gl-cts
<alyssa> fair
<alyssa> anholt's cts runner is a lot newer than piglit tbf
<jannau> i.e because I don't know better
<alyssa> today's lucky 10,000
<alyssa> or like lucky 1 tbf
dylanchapell_ has joined #asahi-gpu
<jannau> do the configs using "gles/khronos_mustpass*" have much additional test coverage over aosp {egl,gles2,gles3,gles31} or are the latter a good approximation of a passed cts run?
<jannau> with deqp-runner only dEQP-EGL.functional.resize.surface_size.{shrink,stretch_width} remain as timeout/hang
<jannau> on t8112
<alyssa> jannau: IIRC the khronos_mustpass is for KHR-GLES* tests
<alyssa> the aosp_mustpass is for dEQP tests
<alyssa> the CTS run contains both
<alyssa> jannau: and woo
<alyssa> I wonder what that's about
<alyssa> jannau: to confirm that's with x11-egl?
<alyssa> eric_engestrom: ^^ any ideas?
<jannau> yes, I'm just wondering if it makes sense to prepare deqp-runner cmdlines for the KHR-GLES* tests
<alyssa> Oh. Yeah, it is
<alyssa> the glcts binary contains all of them
<alyssa> so if you want you can `cat` together the different mustpass files for a big combined run
* alyssa should probably do that
<alyssa> fwiw there are a lot more deqps than khr-gles's, so it's not adding much wall clock time to run the KHR tests too
<jannau> don't they all use different options
<alyssa> no?
<jannau> they are repeated with different --deqp-gl-config-id options
<jannau> they seem to wait on (resize?) events which either are not send or arrive too early
<jannau> after looking at the code it might be a fractional scaling issue
<jannau> yes
<jannau> I'll start a full CTS run on t8112
<alyssa> ah
<alyssa> 17:43 jannau | they are repeated with different --deqp-gl-config-id options
<alyssa> oh, yeah
<alyssa> Yeah, I don't think that's worth it to replicate with deqp-runner
<alyssa> I just meant running the khr tests at all (with a standard config)
dylanchapell has quit [Ping timeout: 480 seconds]
c10l7 has quit []
c10l7 has joined #asahi-gpu
<alyssa> ruh roh why am I seeing faults trying to open keyboard shortcuts in gnome
<alyssa> one bug at a time. beh
<alyssa> oh that's probably the thing Lina just fixed
c10l7 has quit []
c10l has joined #asahi-gpu
bisko has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
cylm has quit [Ping timeout: 480 seconds]
adryzz has joined #asahi-gpu
adryzz has quit [Quit: .]
possiblemeatball has joined #asahi-gpu
brolin has joined #asahi-gpu
brolin has quit []
alyssa has left #asahi-gpu [#asahi-gpu]
alyssa has joined #asahi-gpu
abd has joined #asahi-gpu
jlco_ has joined #asahi-gpu
i509vcb has joined #asahi-gpu
jlco has quit [Ping timeout: 480 seconds]
abd has quit [Ping timeout: 480 seconds]
<alyssa> jannau: I've just pushed 2c0d4fc71fb ("DONOTMERGE: agx: Don't emit wait_pix multiple times")
<alyssa> With that, I can't seem to repro the Fail, although I can still hit timeouts
<alyssa> ---Fuck, never mind, just got the fail
<alyssa> ignore me.
<alyssa> although seemingly the fail might've been carried over
<alyssa> I do find this flake concerningly difficult to repo.
<alyssa> repro.