ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel driver has landed in mainline, userspace driver is part of mesa - Logs at https://oftc.irclog.whitequark.org/lima/
Guest1364 has joined #lima
Guest1364 has left #lima [#lima]
segfault_0 has joined #lima
<segfault_0> hey does anyone know how i'd figure out what's causing gnome not to start on lima specifically on my custom os setup? i get the kernel error "task list is full" and see it flicker between two frames but gnome starts fine with software rendering (so gnome works), with the gpu on a panfrost machine using the same rootfs (so mesa works), and with the same kernel on another distro (so the kernel works)
<segfault_0> oh and i forgot to mention lima is working on the device as well because things like sway and glxgears run fine
<enunes> segfault_0: first ensure you are using up to date mesa and kernel, and maybe also up to date gnome stack
<enunes> if it happens with up to date components then it could be a bug, most likely mesa, but you would have to provide more details on how to reproduce
<enunes> also "custom os" is a bit tricky to work with, it would be good if you could figure out what is the difference from your custom os that would make it possible to reproduce on a common distribution
<segfault_0> enunes: i've tried mesa 21.3.5 and the latest git master as of ~6 hours ago, linux 5.16.1 and gnome 41.3, and the same versions of those components in another distro that worked, but the kernel log hasn't given much useful info and neither has a gnome shell log, hence me asking here if anyone knows if there's a good way to get more information on what's going wrong
<enunes> segfault_0: can you post the full dmesg
<enunes> segfault_0: can you get an apitrace of gnome-shell?
<enunes> I tried launching a minimal gnome-shell on my board and cant reproduce that, but maybe you have configured gnome-shell in a different way
<segfault_0> i normally just start it through gdm
<segfault_0> but starting it manually doesn't fix the problem
<enunes> if it can reproduce also running manually then I think an apitrace of that would be helpful
<segfault_0> yep i'm building apitrace now to have a look
<segfault_0> enunes: the trace is 300kb compressed, do you know of a good service i could use to upload it
<enunes> segfault_0: well, if you dont have some host to upload it to, you could file a gitlab mesa issue and attach it I guess
<enunes> after checking the trace I would suggest to file an issue anyway
<segfault_0> just looking at the trace i see the first frame has over 8000 calls, could that cause this?
<segfault_0> 8836 calls specifically
<enunes> depends calls of what
<segfault_0> aha well i have some errors
<segfault_0> i tried playing back the trace on the device
<segfault_0> it spat out a few errors about trying to use shader "cogl_layer0", "cogl_layer1", "cogl_texel0" and "cogl_texel1" uninitialized
<enunes> thats probably not your issue, its a long standing thing in mutter/gnome-shell
<segfault_0> hmm well the trace displayed as id expect from gnome on another device but appeared broken in the same way on device if that's helpful at all
<enunes> if it reproduces the bug and dmesg report on the device then its helpful
<enunes> then if it is a mesa bug, I might be able to reproduce it locally as well by playing the trace
<enunes> if not, well then it gets a bit more complicated
<segfault_0> yep i see it creates another error in dmesg every time i play it back, it also creates a ton of cfi errors (i have cfi enabled but permissive)
<segfault_0> enunes can you try playing back the trace? (hopefully this is an ok way to send the file?) https://anonfiles.com/jev0eeFaxf/gnome-shell_trace
<segfault_0> i'd rather not put the trace in a public facing issue because it could have identifying information, hard to say
<enunes> well I can play it, but it does not reproduce the issue to me
<segfault_0> ok so that means the issue is most likely in mesa
<enunes> I think the issue now is something that you configured differently in your setup, maybe mesa build, maybe mutter/gnome-shell build
<enunes> probably not a mesa bug as in opengl implementation bug
<enunes> it might still be interesting to figure out what it is, but I think at this point you will have to debug and find out what your custom is is doing differently
<segfault_0> aside from only building panfrost and lima the only thing special about my mesa install is that it's built with clang, i can try building it with gcc to see if that has any impact
<segfault_0> i didn't think that would be an issue since panfrost works fine but who knows at this point
<enunes> panfrost is a completely different implementation
Daanct12 has joined #lima
<enunes> segfault_0: cool, I rebuilt mesa with clang and now I can reproduce it
<enunes> definitely something to look into
<segfault_0> haha well i'm annoyed that i can't use clang for mesa but at least we know where the issue is
<enunes> yeah
<enunes> do you want to file an issue?
Danct12 has quit [Ping timeout: 480 seconds]
Danct12 has joined #lima
<segfault_0> sure enough gnome starts now that i've rebuilt mesa with gcc
<segfault_0> what an odd issue lol
<segfault_0> i can file an issue although i won't be able to provide much information beyond what we've just confirmed
<enunes> thats fine, you can cc me there and just say its an issue running gnome-shell and we confirmed on IRC that it reproduces by building with clang only
Daanct12 has quit [Ping timeout: 480 seconds]
segfault_0 has quit [Quit: Leaving]
drod has joined #lima
<anarsoul> sounds like some alignment issue
<enunes> I checked with a simpler example and both gpir and ppir give very different results
<anarsoul> can you post it somewhere?
<enunes> looked like some list element add/del/remove thing
<enunes> deqp crashes
<enunes> its very borked with clang
<anarsoul> any warnings from clang?
<anarsoul> it may be worth running it through clang-analyzer
<enunes> constants are zero and it misses some gp uniforms
<enunes> constants again and gp scheduling behaves differently, apparently because of some node that is replaced by another somewhere in the list
<anarsoul> fps is lower, so it's expected to have less gp uniforms updates
<anarsoul> it looks like an issue with ppir constants to me
<enunes> fps is just because it dmesg crashes and then does nothing I suppose
<anarsoul> probably
<enunes> ppir constants is one of them, but also gpir scheduler
<enunes> at least :) then have to rerun everything and see
<enunes> ppir seems to be that super complicated bitcopy function in codegen, not the first time
<anarsoul> :)
<anarsoul> but why it affects only consts?
<enunes> dont know, but just adding some prints there resolved it
<enunes> pointer casting is not a joke in that part of the code
<anarsoul> could be violating strict aliasing rule?
<enunes> thats what I imagined too, but not sure yet
drod has quit []