#lima on 2022-02-01 — irc logs at oftc.irclog.whitequark.org

2021-07-26 22:56 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel driver has landed in mainline, userspace driver is part of mesa - Logs at https://oftc.irclog.whitequark.org/lima/

06:59 Guest1364 has joined #lima

07:04 Guest1364 has left #lima [#lima]

07:06 segfault_0 has joined #lima

07:07 <segfault_0> hey does anyone know how i'd figure out what's causing gnome not to start on lima specifically on my custom os setup? i get the kernel error "task list is full" and see it flicker between two frames but gnome starts fine with software rendering (so gnome works), with the gpu on a panfrost machine using the same rootfs (so mesa works), and with the same kernel on another distro (so the kernel works)

07:37 <segfault_0> oh and i forgot to mention lima is working on the device as well because things like sway and glxgears run fine

09:37 <enunes> segfault_0: first ensure you are using up to date mesa and kernel, and maybe also up to date gnome stack

09:38 <enunes> if it happens with up to date components then it could be a bug, most likely mesa, but you would have to provide more details on how to reproduce

09:39 <enunes> also "custom os" is a bit tricky to work with, it would be good if you could figure out what is the difference from your custom os that would make it possible to reproduce on a common distribution

10:15 <segfault_0> enunes: i've tried mesa 21.3.5 and the latest git master as of ~6 hours ago, linux 5.16.1 and gnome 41.3, and the same versions of those components in another distro that worked, but the kernel log hasn't given much useful info and neither has a gnome shell log, hence me asking here if anyone knows if there's a good way to get more information on what's going wrong

10:16 <enunes> segfault_0: can you post the full dmesg

10:26 <segfault_0> https://termbin.com/t7uq

11:05 <enunes> segfault_0: can you get an apitrace of gnome-shell?

11:06 <enunes> I tried launching a minimal gnome-shell on my board and cant reproduce that, but maybe you have configured gnome-shell in a different way

11:11 <segfault_0> i normally just start it through gdm

11:11 <segfault_0> but starting it manually doesn't fix the problem

11:17 <enunes> if it can reproduce also running manually then I think an apitrace of that would be helpful

11:25 <segfault_0> yep i'm building apitrace now to have a look

11:36 <segfault_0> enunes: the trace is 300kb compressed, do you know of a good service i could use to upload it

11:38 <enunes> segfault_0: well, if you dont have some host to upload it to, you could file a gitlab mesa issue and attach it I guess

11:39 <enunes> after checking the trace I would suggest to file an issue anyway

11:51 <segfault_0> just looking at the trace i see the first frame has over 8000 calls, could that cause this?

12:08 <segfault_0> 8836 calls specifically

12:12 <enunes> depends calls of what

12:24 <segfault_0> aha well i have some errors

12:24 <segfault_0> i tried playing back the trace on the device

12:25 <segfault_0> it spat out a few errors about trying to use shader "cogl_layer0", "cogl_layer1", "cogl_texel0" and "cogl_texel1" uninitialized

12:25 <enunes> thats probably not your issue, its a long standing thing in mutter/gnome-shell

12:26 <segfault_0> hmm well the trace displayed as id expect from gnome on another device but appeared broken in the same way on device if that's helpful at all

12:27 <enunes> if it reproduces the bug and dmesg report on the device then its helpful

12:28 <enunes> then if it is a mesa bug, I might be able to reproduce it locally as well by playing the trace

12:30 <enunes> if not, well then it gets a bit more complicated

12:30 <segfault_0> yep i see it creates another error in dmesg every time i play it back, it also creates a ton of cfi errors (i have cfi enabled but permissive)

12:41 <segfault_0> enunes can you try playing back the trace? (hopefully this is an ok way to send the file?) https://anonfiles.com/jev0eeFaxf/gnome-shell_trace

12:47 <segfault_0> i'd rather not put the trace in a public facing issue because it could have identifying information, hard to say

12:49 <enunes> well I can play it, but it does not reproduce the issue to me

12:49 <segfault_0> ok so that means the issue is most likely in mesa

12:50 <enunes> I think the issue now is something that you configured differently in your setup, maybe mesa build, maybe mutter/gnome-shell build

12:50 <enunes> probably not a mesa bug as in opengl implementation bug

12:50 <enunes> it might still be interesting to figure out what it is, but I think at this point you will have to debug and find out what your custom is is doing differently

12:53 <segfault_0> aside from only building panfrost and lima the only thing special about my mesa install is that it's built with clang, i can try building it with gcc to see if that has any impact

12:53 <segfault_0> i didn't think that would be an issue since panfrost works fine but who knows at this point

12:53 <enunes> panfrost is a completely different implementation

13:25 Daanct12 has joined #lima

13:28 <enunes> segfault_0: cool, I rebuilt mesa with clang and now I can reproduce it

13:29 <enunes> definitely something to look into

13:29 <segfault_0> haha well i'm annoyed that i can't use clang for mesa but at least we know where the issue is

13:29 <enunes> yeah

13:30 <enunes> do you want to file an issue?

13:30 Danct12 has quit [Ping timeout: 480 seconds]

13:35 Danct12 has joined #lima

13:35 <segfault_0> sure enough gnome starts now that i've rebuilt mesa with gcc

13:35 <segfault_0> what an odd issue lol

13:36 <segfault_0> i can file an issue although i won't be able to provide much information beyond what we've just confirmed

13:37 <enunes> thats fine, you can cc me there and just say its an issue running gnome-shell and we confirmed on IRC that it reproduces by building with clang only

13:40 Daanct12 has quit [Ping timeout: 480 seconds]

15:57 segfault_0 has quit [Quit: Leaving]

16:41 drod has joined #lima

17:51 <anarsoul> sounds like some alignment issue

18:21 <enunes> I checked with a simpler example and both gpir and ppir give very different results

18:21 <anarsoul> can you post it somewhere?

18:21 <enunes> looked like some list element add/del/remove thing

18:22 <enunes> deqp crashes

18:23 <enunes> its very borked with clang

18:24 <anarsoul> any warnings from clang?

18:24 <anarsoul> it may be worth running it through clang-analyzer

18:29 <enunes> this is one, gcc https://paste.centos.org/view/raw/cf426614 clang https://paste.centos.org/view/raw/7288cd7a

18:29 <enunes> constants are zero and it misses some gp uniforms

18:31 <enunes> then deqp gcc https://paste.centos.org/view/raw/0edcb69b clang https://paste.centos.org/view/raw/f5eebfa7

18:31 <enunes> constants again and gp scheduling behaves differently, apparently because of some node that is replaced by another somewhere in the list

18:37 <anarsoul> fps is lower, so it's expected to have less gp uniforms updates

18:38 <anarsoul> it looks like an issue with ppir constants to me

18:38 <enunes> fps is just because it dmesg crashes and then does nothing I suppose

18:38 <anarsoul> probably

18:38 <enunes> ppir constants is one of them, but also gpir scheduler

18:39 <enunes> at least :) then have to rerun everything and see

19:56 <enunes> ppir seems to be that super complicated bitcopy function in codegen, not the first time

19:57 <anarsoul> :)

19:57 <anarsoul> but why it affects only consts?

19:58 <enunes> dont know, but just adding some prints there resolved it

19:59 <enunes> pointer casting is not a joke in that part of the code

20:00 <anarsoul> could be violating strict aliasing rule?

20:00 <enunes> thats what I imagined too, but not sure yet

22:27 drod has quit []