#asahi-gpu on 2023-06-29 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:46 ChanServ changed the topic of #asahi-gpu to: Asahi Linux GPU development (no user support, NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu

00:02 <alyssa> fedora is my new love

00:02 * alyssa got the valgrind symbols

00:06 jlco has joined #asahi-gpu

00:10 darkapex has joined #asahi-gpu

00:28 nsklaus_ has quit [Remote host closed the connection]

00:31 <alyssa> kicking off another run with the leak fixes

00:32 <alyssa> lina: jannau: we should coordinate who's doing which hardware

00:32 <alyssa> I'll supply classic M1

00:32 <alyssa> jannau: said you can do M1 Max/Ultra/M2

00:33 <alyssa> lina: that leaves you with M1 Pro, if we even want a separate run for that?

00:47 mini_ has quit [Quit: ZNC closing...]

00:51 mini_ has joined #asahi-gpu

00:51 <alyssa> Oh come on. Failed in the exact same place.

00:51 <alyssa> evidently did not fix the leaks hard enough

00:52 <alyssa> lina: https://rosenzweig.io/hrm.txt

00:54 mini_ has quit []

00:56 <alyssa> lina: This is interesting

00:57 mini_ has joined #asahi-gpu

00:57 <alyssa> If I `./glcts -n dEQP-GLES31.functional.*`, I get "collect_garbage: failed to reserve space" in under 5 minutes

00:57 <alyssa> (Well before it gets to the spicy Big Framebuffer test that causes the run to fail.)

00:57 <alyssa> and this is despite low, ~constant memory usage according to top.

00:59 mini_ has quit []

01:01 mini_ has joined #asahi-gpu

01:11 cylm has joined #asahi-gpu

02:03 thevar1able_ has quit [Remote host closed the connection]

02:03 thevar1able_ has joined #asahi-gpu

03:01 c10l7 has joined #asahi-gpu

03:07 c10l has quit [Ping timeout: 480 seconds]

03:08 pyropeter3 has joined #asahi-gpu

03:10 pyropeter2 has quit [Ping timeout: 480 seconds]

03:26 <lina> I can do M2 and M1 Pro, which given our call earlier sounds like all we need with your M1? Or if jannau runs M2 and M1 Max, that covers it as well, or any other combination.

03:26 <lina> (We agreed we only need one each of t8103, t600x, t8112, and eventually t602x)

03:28 <lina> Also conclusion about that OOM thing is that it was the 4K kernel... which is not very surprising since it has known bugs.

03:29 <lina> The actual driver codepath there is just a GFP_KERNEL alloc with lots of memory free/cached is failing, which is definitely not a problem with the driver itself, but I wouldn't be surprised if the 4K/16K shmem thing is breaking memory management...

04:38 lina has quit [Quit: Lost terminal]

04:44 possiblemeatball has joined #asahi-gpu

05:30 possiblemeatball has quit [Quit: Quit]

05:51 lina has joined #asahi-gpu

05:53 <jannau> I was under the impression that we wanted to test t6000/t6001/t6002 separately under the assumption that we effectively test Apple's firmware as well and probably hit different execution paths

05:54 <jannau> I'd be surprised if "there are twice as many clusters" is the only difference on the ultra

06:10 <jannau> lina: piglit run deqp_gles3 seems to hang at test 11377, using VK-GL-CTS opengl-es-cts-3.2.9 DEQP_TARGET=wayland on m1 ultra

06:10 xcpy0 has quit [Quit: Ping timeout (120 seconds)]

06:11 <jannau> kernel includes yesterday's drm sched fix

06:11 xcpy0 has joined #asahi-gpu

06:11 <jannau> seems related to "asahi 406400000.gpu: QueueJob 651041: Job timed out on the DRM scheduler, things will probably break (ran: true)"

06:12 <jannau> are "Unknown event message" interesting? There is one (reproducible) ~100 seconds prior

07:25 <jannau> 'deqp-gles3/functional/shaders/builtin_functions/precision/matrixcompmult/highp_fragment/mat3' seems to be the hanging test. I do not want to disturb the cts run so I will try to confirm that later

07:29 cylm has quit [Ping timeout: 480 seconds]

07:48 <jannau> test passes in the cts run with alyssa's es31 branch and VK-GL-CTS opengl-es-cts-3.2.9.3 with x11_egl

07:49 <jannau> so it could be either a mesa driver bug, a wayland wsi bug or a VK-GL-CTS bug

08:04 <lina> jannau: "Unknown event" is something we can't handle, usually dynamic TVB allocation or similar, so it's no surprise it would hang after that.

08:05 <lina> Is it reproducible if you increase asahi.initial_tvb_size=64 or something like that?

08:06 <lina> I think the logic to compute the minimum initial TVB size is not complete

08:07 <lina> Do you know if the hanging test uses MSAA? I wouldn't be surprised if I need a *sample_count factor in there or something

08:09 <lina> Re CTS runs, I think it's largely the same execution paths for all of a series, plus it's not like we're submitting conformance separately for each firmware version either...

08:09 <lina> I think Ultra really is mostly the same execution path when it comes to rendering, just initialization is probably different.

08:12 <lina> Also I know the initial TVB size is different with clustering, so it wouldn't surprise me if this breaks on Ultra only (or on the whole Pro/Max/Ultra series)

08:16 i509vcb has quit [Quit: Connection closed for inactivity]

08:26 mkurz has quit [Ping timeout: 480 seconds]

08:31 nsklaus_ has joined #asahi-gpu

08:35 lina has quit [Ping timeout: 480 seconds]

08:35 <jannau> all precision/matrixcompmult/highp_fragment/mat3* tests fail but mat2* / mat4* don't, I doubt the tests use MSAA

08:42 <jannau> tests pass with asahi.initial_tvb_size=64

08:49 mkurz has joined #asahi-gpu

08:54 chadmed has quit [Remote host closed the connection]

09:28 c10l7 has quit []

10:28 c10l7 has joined #asahi-gpu

10:29 c10l7 has quit []

10:30 c10l7 has joined #asahi-gpu

10:41 cylm has joined #asahi-gpu

10:57 kode54 has left #asahi-gpu [#asahi-gpu]

10:58 kujeger has quit [Quit: ZNC 1.8.2 - https://znc.in]

11:08 chadmed has joined #asahi-gpu

11:32 mkurz has quit [Ping timeout: 480 seconds]

11:46 c10l7 has quit []

12:00 c10l7 has joined #asahi-gpu

12:43 lina has joined #asahi-gpu

12:56 thelounge60655 has joined #asahi-gpu

13:00 <lina> jannau: Okay, so it's the TVB size thing... what machine is this?

13:02 thelounge6065 has quit [Ping timeout: 480 seconds]

13:02 thelounge60655 is now known as thelounge6065

13:02 kujeger has joined #asahi-gpu

13:03 <_jannau__> lina: mac studio m1 ultra, 48 gpu cores

13:04 <_jannau__> cts results are worse on t6002 compared to t8103, testing now t6001

13:05 <_jannau__> that is with asahi.initial_tvb_size=64 and without unknown event messages

13:41 <lina> What is failing?

13:42 mkurz has joined #asahi-gpu

13:55 <jannau> lina: quite a few atomic_counter tests

13:56 jlco has quit [Ping timeout: 480 seconds]

13:56 <jannau> results on m1 max looked very similar or identical

13:56 jlco has joined #asahi-gpu

13:57 <alyssa> spicy.

14:05 tirr has joined #asahi-gpu

14:07 <jannau> at least this ones: https://paste.debian.net/1284465/ - rerunning the cts atm due to mesa config misshap on my side

14:09 <alyssa> jannau: Those tests at the end are interesting

14:09 <alyssa> They're all compute shader tests, but tests with multiple invocations/groups

14:09 <alyssa> there are also single invocation/group versions of all those, which I assume are passing

14:10 <alyssa> All of these tests use atomics.

14:14 <jannau> yes, the other tests in the same group fail, also only ~60% of the atomic_counter tests with multiple threads fail

14:15 <jannau> err, the other compute shader tests in the same group pass

14:22 <alyssa> this really could be either Mesa or kernel, tbh

14:22 <jannau> is it expected that dEQP-GLES3.functional.flush_finish.flush_wait takes a long time (several seconds) and makes desktop use slow?

14:22 <alyssa> Yes.

14:23 <alyssa> flush_finish.* are the stupid tests that we skip in CI

14:26 <alyssa> thinking either the kernel is configuring the CDM badly on t600x, or Mesa is doing something silly, or..

14:26 <alyssa> Or, geez, I wonder

14:27 <alyssa> I wonder if compute kernels with atomics need to be pinned to a single cluster for correct results

14:27 <alyssa> or need to have barriers on multi-cluster parts that aren't needed on single-cluster parts

14:27 <alyssa> because the atomics aren't coherent across clusters

14:27 <alyssa> and so the compiler needs to report whether atomics are used, and if so either

14:27 <alyssa> 1. add more barriers?

14:28 <alyssa> 2. set a flag in the CDM dispatch structs forcing things to stay together?

14:28 <alyssa> 3. pass a flag to the kernel telling the kernel to pass a flag to the hardware to force things together?

14:28 <alyssa> the way to go about checking this is to diff the agxdecode on t600x for a compute kernel without atomic + compute kernel with

14:28 <alyssa> and see if the shaders diff (other than adding an atomic, obviously) and/or if the structures differ

14:28 <alyssa> lina: ^

14:36 alyssa has quit [Quit: leaving]

14:43 alyssa has joined #asahi-gpu

14:44 alyssa_ has joined #asahi-gpu

14:46 alyssa_ has left #asahi-gpu [#asahi-gpu]

14:46 alyssa_ has joined #asahi-gpu

14:46 <alyssa_> foo

14:46 <alyssa_> alyssa: bar

14:46 <alyssa> ok, that does what I want

14:49 <jannau> alyssa: btw "static_assert(len_words == 4, "64-bit pointer");" doesn't compile on arch since gcc can't see that len_words compile time constant is

14:49 <alyssa> jannau: dammit

14:49 <jannau> in the case we need commit accurate mesa version information for cts

14:49 <alyssa> alright.

14:49 <alyssa> will fix that

14:54 <jannau> m2 seems to hang on 'dEQP-EGL.functional.resize.surface_size.shrink'

15:01 <alyssa> hang how?

15:03 <jannau> cts-runner doesn't progress for at least 3 minutes, with the last printed test dEQP-EGL.functional.resize.surface_size.shrink

15:19 alyssa has quit [Quit: alyssa]

15:19 alyssa_ has quit [Remote host closed the connection]

15:22 alyssa has joined #asahi-gpu

15:23 <alyssa> _jannau__: curious

15:39 <jannau> piglit deqp_gles31 results on t8112: 8x fail, 7x crash https://paste.debian.net/1284478/

15:49 dylanchapell has joined #asahi-gpu

15:59 <jannau> piglit deqp_egl results on t8112: 1x fail, 1x crash, 2x hangs (both surface_size "crashes"( https://paste.debian.net/1284481/

16:47 <alyssa> why piglit btw?

16:48 <alyssa> I highly recommend https://gitlab.freedesktop.org/anholt/deqp-runner

16:48 <alyssa> I have `run-deqp31` aliased to

16:48 <alyssa> MESA_SHADER_CACHE_DISABLE=1 EGL_PLATFORM=surfaceless deqp-runner run --deqp ~/GLCTS/glcts --output output --jobs 8 --renderer-check G13 --caselist ~/GLCTS/gl_cts/data/mustpass/gles/aosp_mustpass/main/gles31-master.txt --skips ~/mesa/.gitlab-ci/all-skips.txt -- --deqp-surface-type=pbuffer --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 --deqp-surface-height=256

16:48 <alyssa> --deqp-log-images=disable --deqp-log-shader-sources=disable

16:49 alyssa_ has joined #asahi-gpu

16:49 <alyssa_> alyssa: bloop

16:49 <alyssa> bloop

16:49 alyssa_ has quit []

16:50 <alyssa> jannau: as for the deqp-gles31 results--

16:50 <alyssa> that is passing to me

16:50 <alyssa> and run-deqp31 would indicate that as passing

16:50 <alyssa> the image atomic comp swap return value tests are known broken and not included in the mustpass list, piglit is only running them because piglit is wrong

16:51 <alyssa> and the stress tests are also not included in mustpass, only the functional ones contribute to conformance

16:51 <alyssa> it'd be great if we passed stress too but it's not part of the CTS currently

16:56 <jannau> because I have a bad memory, it's noted somewhere in the developer section on the website and it's at least nicer than running it from vk-gl-cts

16:57 <alyssa> fair

16:57 <alyssa> anholt's cts runner is a lot newer than piglit tbf

16:57 <jannau> i.e because I don't know better

16:59 <alyssa> today's lucky 10,000

16:59 <alyssa> or like lucky 1 tbf

17:00 dylanchapell_ has joined #asahi-gpu

17:35 <jannau> do the configs using "gles/khronos_mustpass*" have much additional test coverage over aosp {egl,gles2,gles3,gles31} or are the latter a good approximation of a passed cts run?

17:36 <jannau> with deqp-runner only dEQP-EGL.functional.resize.surface_size.{shrink,stretch_width} remain as timeout/hang

17:36 <jannau> on t8112

17:38 <alyssa> jannau: IIRC the khronos_mustpass is for KHR-GLES* tests

17:38 <alyssa> the aosp_mustpass is for dEQP tests

17:38 <alyssa> the CTS run contains both

17:39 <alyssa> jannau: and woo

17:39 <alyssa> I wonder what that's about

17:40 <alyssa> jannau: to confirm that's with x11-egl?

17:40 <alyssa> eric_engestrom: ^^ any ideas?

17:40 <jannau> yes, I'm just wondering if it makes sense to prepare deqp-runner cmdlines for the KHR-GLES* tests

17:41 <alyssa> Oh. Yeah, it is

17:41 <alyssa> the glcts binary contains all of them

17:41 <alyssa> so if you want you can `cat` together the different mustpass files for a big combined run

17:41 * alyssa should probably do that

17:42 <alyssa> fwiw there are a lot more deqps than khr-gles's, so it's not adding much wall clock time to run the KHR tests too

17:42 <jannau> don't they all use different options

17:42 <alyssa> no?

17:43 <jannau> they are repeated with different --deqp-gl-config-id options

17:49 <jannau> they seem to wait on (resize?) events which either are not send or arrive too early

18:00 <jannau> after looking at the code it might be a fractional scaling issue

18:00 <jannau> yes

18:01 <jannau> I'll start a full CTS run on t8112

18:07 <alyssa> ah

18:07 <alyssa> 17:43 jannau | they are repeated with different --deqp-gl-config-id options

18:07 <alyssa> oh, yeah

18:07 <alyssa> Yeah, I don't think that's worth it to replicate with deqp-runner

18:07 <alyssa> I just meant running the khr tests at all (with a standard config)

18:33 dylanchapell has quit [Ping timeout: 480 seconds]

18:45 c10l7 has quit []

18:50 c10l7 has joined #asahi-gpu

18:55 <alyssa> ruh roh why am I seeing faults trying to open keyboard shortcuts in gnome

18:55 <alyssa> one bug at a time. beh

18:55 <alyssa> oh that's probably the thing Lina just fixed

19:03 c10l7 has quit []

19:04 c10l has joined #asahi-gpu

19:13 bisko has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

19:50 cylm has quit [Ping timeout: 480 seconds]

19:59 adryzz has joined #asahi-gpu

20:40 adryzz has quit [Quit: .]

21:41 possiblemeatball has joined #asahi-gpu

21:49 brolin has joined #asahi-gpu

21:51 brolin has quit []

22:37 alyssa has left #asahi-gpu [#asahi-gpu]

22:37 alyssa has joined #asahi-gpu

22:47 abd has joined #asahi-gpu

22:56 jlco_ has joined #asahi-gpu

22:57 i509vcb has joined #asahi-gpu

22:57 jlco has quit [Ping timeout: 480 seconds]

23:26 abd has quit [Ping timeout: 480 seconds]

23:50 <alyssa> jannau: I've just pushed 2c0d4fc71fb ("DONOTMERGE: agx: Don't emit wait_pix multiple times")

23:50 <alyssa> With that, I can't seem to repro the Fail, although I can still hit timeouts

23:50 <alyssa> ---Fuck, never mind, just got the fail

23:50 <alyssa> ignore me.

23:51 <alyssa> although seemingly the fail might've been carried over

23:53 <alyssa> I do find this flake concerningly difficult to repo.

23:54 <alyssa> repro.