#zink on 2022-03-24 — irc logs at oftc.irclog.whitequark.org

2022-03-22 11:52 ChanServ changed the topic of #zink to: official development channel for the mesa3d zink driver || https://docs.mesa3d.org/drivers/zink.html

09:59 LexSfX has quit [Ping timeout: 480 seconds]

10:09 LexSfX has joined #zink

10:51 LexSfX has quit [Ping timeout: 480 seconds]

11:30 LexSfX has joined #zink

12:21 LexSfX has quit [Read error: Connection reset by peer]

12:21 LexSfX has joined #zink

14:12 <zmike> ajax: is error checking necessary?

14:12 <zmike> or I guess this is just so that we can avoid crashing?

14:13 <ajax> it... seems to be

14:13 <zmike> can you give me a backtrace of it real quick?

14:13 <ajax> the hang is when the window is destroyed from under us because you clicked the Close button

14:13 <ajax> so you have to make all your tc stuff fizzle

14:13 <ajax> yeah one sec

14:16 <ajax> tc not totally guilty here, pretty sure you'd need to wait for draw batches to fizzle too. anyway

14:16 <zmike> right, that's the case I'm considering in the "is error checking necessary?" question

14:18 <ajax> also... i think you're unlikely to hit this unless you have swapinterval working, since that way you have routine 16ms intervals where the window can get destroyed

14:18 <zmike> explains why I haven't seen it

14:19 <ajax> https://gitlab.freedesktop.org/ajax/mesa/-/tree/kopper-dc5-swapinterval is what i'm working from

14:20 <ajax> backtrace at the commit that adds swapinterval looks like https://paste.centos.org/view/raw/b9d2d78e

14:21 <zmike> hm and it's just looping and getting out of date

14:21 <zmike> ?

14:21 <zmike> makes sense

14:21 <ajax> it's looping getting something

14:21 <zmike> yeah

14:22 <ajax> didn't check in detail because that ??? comment looked juicy

14:22 <zmike> yeah I'm not really sure how to handle this tbh

14:22 <ajax> and the top patch in that branch starts threading a return code out from zink_kopper_acquire

14:22 <zmike> as you've seen, it's basically everywhere in the driver

14:22 <zmike> that's what I'm looking at now

14:23 <zmike> I don't think it's going to end up being particularly sane to try and handle that function failing

14:23 <ajax> iunno. i just kept changing return types from void to bool and i kept getting further.

14:23 <zmike> well yes, you will get further haha

14:24 <zmike> but I don't know that it's necessarily further in a winning direction

14:24 <ajax> it might not be sane but i don't think we have the option. you could hit this by trying to close like a firefox window.

14:24 <zmike> does the swapchain just stay permanently in out of date status?

14:24 <ajax> oh no, it's destroyed.

14:24 <zmike> sure, I'm saying I think there's a simpler solution

14:24 <zmike> what's the swapchain status?

14:24 <ajax> like it's gone gone. that thread done exited.

14:25 <zmike> ok but the function still has to return something

14:25 <zmike> in kopper_GetSwapchainImages I guess

14:27 <ajax> oh, yeah, sorry. yes, ERROR_OUT_OF_DATE sticks on the swapchain, and x11_manage_fifo_queues for that swapchain has exited

14:27 <zmike> hm

14:27 <zmike> but out-of-date can happen just from needing to be called again

14:28 <zmike> I think it should return VK_ERROR_SURFACE_LOST_KHR in this case

14:28 <zmike> (talking about AcquireNextImageKHR)

14:28 <zmike> that would avoid the loop

14:29 <ajax> hm. let me double check exactly how that thread exits, we should be able to set that from there

14:30 <zmike> the //????? comment was basically I needed to figure out some kind of tolerance threshold where I would need to consider the swapchain dead and probably set device_lost or something

14:30 <zmike> because I had to assume I'd never get a good swapchain

14:30 <zmike> but this codepath should, in practice, never be hit on a well-functioning driver

14:35 <ajax> what's the vulkan version of GL_KHR_debug

14:36 <zmike> what's that one do again?

14:36 <zmike> debug markers?

14:37 <ajax> callbacks

14:37 <ajax> select facility and error level, get notified

14:37 <zmike> maybe EXT_debug_utils?

14:38 <ajax> i find myself adding a lot of fprintf to x11_manage_fifo_queues and i feel like i should make those happen in-band if that's a thing

14:38 <turol> GL_KHR_debug also has object private data and debug markers

14:38 <zmike> sounds like a VK_EXT_debug_utils type thing

14:38 <ajax> yeah there we are

14:38 * ajax opens tab for later

15:01 <zmike> kusma: the tex-miplevel-selection cases seem like the lod is off by more than just 1.0

15:02 <zmike> though that did fix the basic cubemap case

15:11 <ajax> woo think i got it

15:12 <zmike> \o/

16:04 <ajax> zmike: kopper-dc5-swapinterval updated with kopper_acquire error handling

16:05 <ajax> i would be unsurprised if it has memory leaks, but

16:05 <zmike> ajax: will check in a bit

16:05 <zmike> sidebar: anholt: I think it should now be possible to do a full glcts asan run in ci

17:14 <zmike> ajax: I don't think the error handling is necessary beyond what you've done in zink_kopper.c

17:14 <zmike> well

17:15 <zmike> I think in theory VK_ERROR_OUT_OF_DATE_KHR is allowed

17:15 <zmike> and that should trigger polling like it's doing

17:16 <zmike> and the issue is more of whatever vkerror is coming out in update_swapchain back to kopper_acquire() and then determining what to do

17:16 <zmike> if the error is terminal then just set screen->device_lost and do nothing further

17:16 <zmike> execution in zink can proceed normally and it'll just noop everything automatically and be fine

17:17 <ajax> this isn't device lost though, it's surface lost

17:22 <zmike> sure, but I think they're functionally the same here?

17:24 <ajax> here, sure. i don't see why i should be limited to one window of firefox.

17:25 <zmike> hm this is true

17:27 <zmike> surface lost = context lost, yes?

17:27 <ajax> nope

17:27 <zmike> fuck me today's not going well at all

17:28 <ajax> context can be attached to surfaces but they have independent lifetimes

17:28 <zmike> I think this type of error handling is prone to blowing up as execution continues

17:29 <zmike> so I think probably what needs to happen is that the swapchain, on exploding, then pulls in a dummy image

17:29 <zmike> and zink continues chugging along on that until things get fixed and propagate back to the driver

17:29 <ajax> or, you mutex around access to the swapchain, and take/release it as it comes in and out of batch reference?

17:29 <ajax> either works i guess

17:30 <zmike> uhhhhh

17:30 <zmike> hold on I gotta brain-up to understand that one right now

17:30 <ajax> gotta take the dog out for a bit anyway, bbiab

17:31 <zmike> yeah still not seeing how that would help exactly

17:31 LexSfX has quit []

17:32 <zmike> acquire can be called in a number of places, and if it fails, then there's effectively a garbage image in place, which is invalid usage and may crash

17:32 <zmike> frontend will catch up to that, but until then...

17:32 * zmike takes out the hacksaw

17:38 LexSfX has joined #zink

17:47 <ajax> i think i like the dummy image idea anyway

17:47 <zmike> I'm almost done I think

17:48 <ajax> because: this may only be surface lost, but it's lost, we should make progress. refcounting the surface means you have to wait for rendering to advance to it for destroy to return

17:52 <zmike> think this should do it

17:52 <zmike> lemme pull in your changes and test

17:52 <zmike> you just run glxgears and click the close button?

17:53 <ajax> yep

17:57 <zmike> hm

17:58 <zmike> you on xwayland or xserver?

17:58 <ajax> xwayland

17:59 <zmike> yea ok it's different in xwl than xserver

18:04 <zmike> ajax: I'm still not seeing any kind of deadlock here, at least on anv

18:04 <zmike> it just crashes

18:05 <zmike> though I can certainly imagine a deadlock would exist

18:14 <ajax> alright, i have good news

18:14 <ajax> !15558 is all i need to fix the hang

18:14 <ajax> don't actually need the rest of the kopper_acquire stuff

18:15 <ajax> or, maybe we do in some more difficult-to-provoke race, but gears is behaving for me now

18:17 <ajax> maybe just change that ??? to assert(!"how did i get here") and move on

18:17 <zmike> well I've got the code to handle it now so I may as well jam it in

18:17 <ajax> wfm

18:22 <zmike> yea ok this works perfectly

18:26 <ajax> excellent

18:26 <ajax> what else can i break

18:27 <zmike> ajax: pushed to zmike/copper

18:27 <zmike> top patch

18:31 <ajax> not so great

18:31 <zmike> hm

18:32 <ajax> ugh wait.

18:32 <ajax> that still doesn't have the radv color write patch in

18:32 <zmike> oh

18:33 <zmike> lemme rebase

18:33 <zmike> rebased

18:35 <ajax> there we go. zombie swapchain patch works for me

18:36 <zmike> hooray

18:38 <ajax> kicking off a ci run for !14541

18:55 <ajax> kablooie

18:58 <zmike> updated again to add rbs...

18:58 * zmike checks pipeline results

18:59 <zmike> devastating

18:59 <zmike> oh that's the same one from https://gitlab.freedesktop.org/zmike/mesa/-/issues/109

19:04 <ajax> yeah, working my way through them

19:05 <zmike> noice

19:05 * zmike tries removing the ANV EXT_color_write_enable driver workaround

19:13 <zmike> ajax: I also accidentally tried running piglit without that wsi deadlock patch and there's a TON of deadlocked tests

19:15 <zmike> somewhat weird because I don't recall there being this many before, but I guess that's how progress works

19:16 <ajax> :/

19:17 <zmike> definitely a :/ situation

19:19 <ajax> any pattern to the tests?

19:19 <zmike> uhhh

19:20 <zmike> bin/gl-1.0-drawbuffer-modes -auto

19:20 <zmike> bin/viewport-clamp -auto -fbo

19:20 <zmike> bin/gl-1.0-front-invalidate-back -auto

19:20 <zmike> bin/gl-1.0-swapbuffers-behavior -auto

19:20 <zmike> bin/masked-clear -auto

19:20 <ajax> mmm

19:20 <zmike> bin/read-front -auto

19:22 <ajax> so, not really a pattern. and none of them failing for me.

19:22 <zmike> :/

19:22 <ajax> yeah

19:22 <zmike> I'm even running them in xwayland

19:23 * ajax tries a real piglit run

19:23 <zmike> might be the kind of thing where the xserver just gets into a bad state from being spammed with too much bullshit

19:23 <zmike> but I didn't try running them again manually since I need to actually finish some regression tests

19:24 <ajax> no worries, i'm sure i'll provoke something

19:49 <ajax> nothing yet though

19:50 <zmike> weird

20:09 LexSfX has quit [Remote host closed the connection]

20:10 quantum5 has quit [Quit: ZNC - https://znc.in]

20:10 quantum5 has joined #zink

20:11 LexSfX has joined #zink

20:11 <ajax> i do have a few reproducible regressions in piglit i think

20:13 <zmike> I was getting a stupid amount of crashes during that run

20:14 <ajax> https://paste.centos.org/view/raw/8bf73825

20:15 daniels has quit [Write error: connection closed]

20:15 jabashque has quit [Write error: connection closed]

20:15 <zmike> which piglit profile are you using?

20:15 jabashque has joined #zink

20:15 daniels has joined #zink

20:16 <ajax> which what now? i'm just running the one test binary directly

20:18 zmike has quit [Read error: No route to host]

20:19 zmike has joined #zink

20:23 jabashque has quit [Read error: No route to host]

20:24 jabashque has joined #zink

20:32 <ajax> that seems like the only non-flaky regression here though

20:32 <zmike> think i missed whatever was said due to irccloud flakiness

20:33 <ajax> 16:16 < ajax> which what now? i'm just running the one test binary directly

20:33 <ajax> HOWEVER

20:33 <ajax> once again !15558 cures that assert

20:33 <zmike> I missed more than one line

20:34 <ajax> https://paste.centos.org/view/raw/5060fc89

20:34 <zmike> oh

20:35 * zmike clicks through this pastebin to another pastebin

20:35 <zmike> huh

20:35 <zmike> weird

21:40 <ajax> so i'm working through the build errors

21:40 <ajax> and i think i have found two distinct compiler bugs

21:41 <ajax> fedora's gcc hates one of the gtests that tries to do clever shit to figure out if stack growth work, but it says a variable is maybe used unitialize when in fact only its address is ever considered

21:42 <ajax> and the debian-gallium build seems to have confused a label with a symbol: https://gitlab.freedesktop.org/ajax/mesa/-/jobs/20189239#L3345