user982492 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
Hibyehello has quit [Ping timeout: 480 seconds]
Hibyehello has joined #asahi-gpu
Hibyehello has quit [Ping timeout: 480 seconds]
JTL has quit [Remote host closed the connection]
JTL has joined #asahi-gpu
JTL has quit [Remote host closed the connection]
JTL has joined #asahi-gpu
jhan has quit [Remote host closed the connection]
jhan has joined #asahi-gpu
jhan has quit [Ping timeout: 480 seconds]
bisko has joined #asahi-gpu
Hibyehello has joined #asahi-gpu
<lina>
alyssa: I can't reproduce your splat but I think I know what happened. I thought there would be something in drm_sched to wait for in-progress jobs when a scheduler/entity gets destroyed, but it doesn't look like it... so you probably had a GPU job submitted, then the userspace process aborted, then the kernel freed all the scheduler stuff and when the GPU job completed it crashed because the scheduler
<lina>
was gone.
<lina>
To make things even more confusing, the job completion isn't via a job reference, it's via a fence...
<lina>
I've added a reference from the job to the scheduler and I think that will fix it... since the job should never get destroyed until the scheduler cleans it up from its main loop, which can only happen when the fence gets signaled or fails, so that should mean the scheduler always outlives the job outlives the fence...
<lina>
This ownership/lifetime stuff is so subtle and completely undocumented in C APIs, it's such a mess T_T
<lina>
I'll look at the piglit stuff next though I'm less worried about that since I think we've always had corner case GPU crash bugs (as much as I've tried to eliminate them...)
<lina>
And GPU crashes are better behaved now, at least it doesn't just hang your system
pjakobsson has joined #asahi-gpu
<lina>
Okay, reproduced the sched thing with something deliberate (8Kx8K glmarks getting killed in a loop) ^^
<lina>
Let's see if my fix worked...
<lina>
Okay, I fixed the splat but I have another issue... I'm leaking slots somewhere with this workload, it runs out
<lina>
... and I can't reproduce it now? ;;
<lina>
Why do I get the feeling this is drm_sched again... something like killing the entity stops jobs from being run, but doesn't cancel/free pending jobs...
<lina>
Ohhh wait, I think I'm never calling the entity cleanup function. Okay, that one's on me then...
<lina>
Wait no I do
jhan has joined #asahi-gpu
<lina>
Ah, this could be a bad firmware interaction... I do know I invalidate context before waiting for jobs to complete, which is probably a bad idea. Maybe that just kills things and leaves the jobs dangling, never to complete.
<lina>
Nope, this is getting signaled... so why is the scheduler not cleaning this up?
DarkShadow44 has joined #asahi-gpu
stickytoffee has quit [Quit: brb]
DarkShadow44 has joined #asahi-gpu
<lina>
And this time I crashed RTKit ^^;;
<lina>
That could just be the context issue I mentioned though... but first I want to find out why I'm leaking jobs...
stickytoffee has joined #asahi-gpu
<lina>
This is weeeird... the job cleanup callback gets called but the job doesn't get dropped sometimes?
nyilas has joined #asahi-gpu
<lina>
Ohh... am I deadlocking by any chance?
<lina>
Yeeeah...
<lina>
Okay, I can't put a reference to the scheduler in the job, because if it is the last reference dropping the scheduler from the job cleanup callback deadlocks ;;
<lina>
Maybe drm_sched_stop() before killing the scheduler will do what I want...?
<lina>
No, but that only cleans up completed jobs and detaches the callbacks, it doesn't actually free pending jobs because it assumes you want to restart the queue later...
<lina>
I think I need to modify the C side for this, this is just broken, I have no idea how to safely wrap this API without duplicating job tracking...
<lina>
Reproduced the RTKit crash... now did my GpuContext thing fix it?
<lina>
[ 83.025087] asahi 206400000.gpu: Allocator: Corruption after object of type asahi::fw::fragment::RunFragmentG13V12_3 at 0xffffffa00009be00:0x928 + 0x0..0x5
<lina>
Ooooo that's a new one
MajorBiscuit has joined #asahi-gpu
jhan has quit [Ping timeout: 480 seconds]
kode54 has quit [Quit: Ping timeout (120 seconds)]
kode54 has joined #asahi-gpu
hightower2 has joined #asahi-gpu
<lina>
streaming-texture (or something) is OOMing for me...
<lina>
Excluding that though, I got through a piglit run ^^
<lina>
Trying again with higher GL...
<lina>
Still works ^^
<lina>
Let me run with a fix for that corruption warning and see how that goes...
<lina>
*Wild guess* those fields might have something to do with preemption, that sounds like the kind of thing piglit would end up triggering...
<lina>
I think it's fixed! ^^
<lina>
alyssa: Please uprev your kernel, I think I fixed both issues ^^
jhan has joined #asahi-gpu
possiblemeatball has joined #asahi-gpu
<lina>
alyssa: Also this is now rebased on 6.2 with DCP changes, so you might need a m1n1 update too (I did)