ChanServ changed the topic of #zink to: official development channel for the mesa3d zink driver || https://docs.mesa3d.org/drivers/zink.html
LexSfX has quit [Ping timeout: 480 seconds]
LexSfX has joined #zink
LexSfX has quit [Ping timeout: 480 seconds]
LexSfX has joined #zink
LexSfX has quit [Read error: Connection reset by peer]
LexSfX has joined #zink
<zmike> ajax: is error checking necessary?
<zmike> or I guess this is just so that we can avoid crashing?
<ajax> it... seems to be
<zmike> can you give me a backtrace of it real quick?
<ajax> the hang is when the window is destroyed from under us because you clicked the Close button
<ajax> so you have to make all your tc stuff fizzle
<ajax> yeah one sec
<ajax> tc not totally guilty here, pretty sure you'd need to wait for draw batches to fizzle too. anyway
<zmike> right, that's the case I'm considering in the "is error checking necessary?" question
<ajax> also... i think you're unlikely to hit this unless you have swapinterval working, since that way you have routine 16ms intervals where the window can get destroyed
<zmike> explains why I haven't seen it
<ajax> backtrace at the commit that adds swapinterval looks like https://paste.centos.org/view/raw/b9d2d78e
<zmike> hm and it's just looping and getting out of date
<zmike> ?
<zmike> makes sense
<ajax> it's looping getting something
<zmike> yeah
<ajax> didn't check in detail because that ??? comment looked juicy
<zmike> yeah I'm not really sure how to handle this tbh
<ajax> and the top patch in that branch starts threading a return code out from zink_kopper_acquire
<zmike> as you've seen, it's basically everywhere in the driver
<zmike> that's what I'm looking at now
<zmike> I don't think it's going to end up being particularly sane to try and handle that function failing
<ajax> iunno. i just kept changing return types from void to bool and i kept getting further.
<zmike> well yes, you will get further haha
<zmike> but I don't know that it's necessarily further in a winning direction
<ajax> it might not be sane but i don't think we have the option. you could hit this by trying to close like a firefox window.
<zmike> does the swapchain just stay permanently in out of date status?
<ajax> oh no, it's destroyed.
<zmike> sure, I'm saying I think there's a simpler solution
<zmike> what's the swapchain status?
<ajax> like it's gone gone. that thread done exited.
<zmike> ok but the function still has to return something
<zmike> in kopper_GetSwapchainImages I guess
<ajax> oh, yeah, sorry. yes, ERROR_OUT_OF_DATE sticks on the swapchain, and x11_manage_fifo_queues for that swapchain has exited
<zmike> hm
<zmike> but out-of-date can happen just from needing to be called again
<zmike> I think it should return VK_ERROR_SURFACE_LOST_KHR in this case
<zmike> (talking about AcquireNextImageKHR)
<zmike> that would avoid the loop
<ajax> hm. let me double check exactly how that thread exits, we should be able to set that from there
<zmike> the //????? comment was basically I needed to figure out some kind of tolerance threshold where I would need to consider the swapchain dead and probably set device_lost or something
<zmike> because I had to assume I'd never get a good swapchain
<zmike> but this codepath should, in practice, never be hit on a well-functioning driver
<ajax> what's the vulkan version of GL_KHR_debug
<zmike> what's that one do again?
<zmike> debug markers?
<ajax> callbacks
<ajax> select facility and error level, get notified
<zmike> maybe EXT_debug_utils?
<ajax> i find myself adding a lot of fprintf to x11_manage_fifo_queues and i feel like i should make those happen in-band if that's a thing
<turol> GL_KHR_debug also has object private data and debug markers
<zmike> sounds like a VK_EXT_debug_utils type thing
<ajax> yeah there we are
* ajax opens tab for later
<zmike> kusma: the tex-miplevel-selection cases seem like the lod is off by more than just 1.0
<zmike> though that did fix the basic cubemap case
<ajax> woo think i got it
<zmike> \o/
<ajax> zmike: kopper-dc5-swapinterval updated with kopper_acquire error handling
<ajax> i would be unsurprised if it has memory leaks, but
<zmike> ajax: will check in a bit
<zmike> sidebar: anholt: I think it should now be possible to do a full glcts asan run in ci
<zmike> ajax: I don't think the error handling is necessary beyond what you've done in zink_kopper.c
<zmike> well
<zmike> I think in theory VK_ERROR_OUT_OF_DATE_KHR is allowed
<zmike> and that should trigger polling like it's doing
<zmike> and the issue is more of whatever vkerror is coming out in update_swapchain back to kopper_acquire() and then determining what to do
<zmike> if the error is terminal then just set screen->device_lost and do nothing further
<zmike> execution in zink can proceed normally and it'll just noop everything automatically and be fine
<ajax> this isn't device lost though, it's surface lost
<zmike> sure, but I think they're functionally the same here?
<ajax> here, sure. i don't see why i should be limited to one window of firefox.
<zmike> hm this is true
<zmike> surface lost = context lost, yes?
<ajax> nope
<zmike> fuck me today's not going well at all
<ajax> context can be attached to surfaces but they have independent lifetimes
<zmike> I think this type of error handling is prone to blowing up as execution continues
<zmike> so I think probably what needs to happen is that the swapchain, on exploding, then pulls in a dummy image
<zmike> and zink continues chugging along on that until things get fixed and propagate back to the driver
<ajax> or, you mutex around access to the swapchain, and take/release it as it comes in and out of batch reference?
<ajax> either works i guess
<zmike> uhhhhh
<zmike> hold on I gotta brain-up to understand that one right now
<ajax> gotta take the dog out for a bit anyway, bbiab
<zmike> yeah still not seeing how that would help exactly
LexSfX has quit []
<zmike> acquire can be called in a number of places, and if it fails, then there's effectively a garbage image in place, which is invalid usage and may crash
<zmike> frontend will catch up to that, but until then...
* zmike takes out the hacksaw
LexSfX has joined #zink
<ajax> i think i like the dummy image idea anyway
<zmike> I'm almost done I think
<ajax> because: this may only be surface lost, but it's lost, we should make progress. refcounting the surface means you have to wait for rendering to advance to it for destroy to return
<zmike> think this should do it
<zmike> lemme pull in your changes and test
<zmike> you just run glxgears and click the close button?
<ajax> yep
<zmike> hm
<zmike> you on xwayland or xserver?
<ajax> xwayland
<zmike> yea ok it's different in xwl than xserver
<zmike> ajax: I'm still not seeing any kind of deadlock here, at least on anv
<zmike> it just crashes
<zmike> though I can certainly imagine a deadlock would exist
<ajax> alright, i have good news
<ajax> !15558 is all i need to fix the hang
<ajax> don't actually need the rest of the kopper_acquire stuff
<ajax> or, maybe we do in some more difficult-to-provoke race, but gears is behaving for me now
<ajax> maybe just change that ??? to assert(!"how did i get here") and move on
<zmike> well I've got the code to handle it now so I may as well jam it in
<ajax> wfm
<zmike> yea ok this works perfectly
<ajax> excellent
<ajax> what else can i break
<zmike> ajax: pushed to zmike/copper
<zmike> top patch
<ajax> not so great
<zmike> hm
<ajax> ugh wait.
<ajax> that still doesn't have the radv color write patch in
<zmike> oh
<zmike> lemme rebase
<zmike> rebased
<ajax> there we go. zombie swapchain patch works for me
<zmike> hooray
<ajax> kicking off a ci run for !14541
<ajax> kablooie
<zmike> updated again to add rbs...
* zmike checks pipeline results
<zmike> devastating
<ajax> yeah, working my way through them
<zmike> noice
* zmike tries removing the ANV EXT_color_write_enable driver workaround
<zmike> ajax: I also accidentally tried running piglit without that wsi deadlock patch and there's a TON of deadlocked tests
<zmike> somewhat weird because I don't recall there being this many before, but I guess that's how progress works
<ajax> :/
<zmike> definitely a :/ situation
<ajax> any pattern to the tests?
<zmike> uhhh
<zmike> bin/gl-1.0-drawbuffer-modes -auto
<zmike> bin/viewport-clamp -auto -fbo
<zmike> bin/gl-1.0-front-invalidate-back -auto
<zmike> bin/gl-1.0-swapbuffers-behavior -auto
<zmike> bin/masked-clear -auto
<ajax> mmm
<zmike> bin/read-front -auto
<ajax> so, not really a pattern. and none of them failing for me.
<zmike> :/
<ajax> yeah
<zmike> I'm even running them in xwayland
* ajax tries a real piglit run
<zmike> might be the kind of thing where the xserver just gets into a bad state from being spammed with too much bullshit
<zmike> but I didn't try running them again manually since I need to actually finish some regression tests
<ajax> no worries, i'm sure i'll provoke something
<ajax> nothing yet though
<zmike> weird
LexSfX has quit [Remote host closed the connection]
quantum5 has quit [Quit: ZNC - https://znc.in]
quantum5 has joined #zink
LexSfX has joined #zink
<ajax> i do have a few reproducible regressions in piglit i think
<zmike> I was getting a stupid amount of crashes during that run
daniels has quit [Write error: connection closed]
jabashque has quit [Write error: connection closed]
<zmike> which piglit profile are you using?
jabashque has joined #zink
daniels has joined #zink
<ajax> which what now? i'm just running the one test binary directly
zmike has quit [Read error: No route to host]
zmike has joined #zink
jabashque has quit [Read error: No route to host]
jabashque has joined #zink
<ajax> that seems like the only non-flaky regression here though
<zmike> think i missed whatever was said due to irccloud flakiness
<ajax> 16:16 < ajax> which what now? i'm just running the one test binary directly
<ajax> HOWEVER
<ajax> once again !15558 cures that assert
<zmike> I missed more than one line
<zmike> oh
* zmike clicks through this pastebin to another pastebin
<zmike> huh
<zmike> weird
<ajax> so i'm working through the build errors
<ajax> and i think i have found two distinct compiler bugs
<ajax> fedora's gcc hates one of the gtests that tries to do clever shit to figure out if stack growth work, but it says a variable is maybe used unitialize when in fact only its address is ever considered
<ajax> and the debian-gallium build seems to have confused a label with a symbol: https://gitlab.freedesktop.org/ajax/mesa/-/jobs/20189239#L3345