ChanServ changed the topic of #zink to: official development channel for the mesa3d zink driver || https://docs.mesa3d.org/drivers/zink.html
Sachiel_ has joined #zink
Sachiel has quit [Ping timeout: 480 seconds]
Sachiel_ has quit [Ping timeout: 480 seconds]
Sachiel_ has joined #zink
avane has quit [Ping timeout: 480 seconds]
avane has joined #zink
sinanmoh- has joined #zink
sinanmohd has quit [Read error: Connection reset by peer]
Sachiel_ is now known as Sachiel
<dj-death> anholt: wondering if you can help me make sense of a zink ci failure
<dj-death> anholt: here is the failure : https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39468117
<dj-death> anholt: and that's a passing run : https://gitlab.freedesktop.org/llandwerlin/mesa/-/jobs/39463706
<dj-death> anholt: apparently both run into a hang
<dj-death> anholt: that sounds odd, is that expected?
<dj-death> another interesting failure : https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39480051
<dj-death> 2023-04-06 16:40:05.459149: ERROR - Piglit error: glx-make-current: ../src/vulkan/runtime/vk_image.c:57: vk_image_init: Assertion `pCreateInfo->extent.width > 0' failed.
<anholt> dj-death: the obvious line to me in the failing job log is 2023-04-06 14:44:50.293979: [ 650.772497] Fence expiration time out i915-0000:00:02.0:glcts[2102]:18d32!
<anholt> dj-death: and right before that the gpu was throttled to 150 mhz
<anholt> retry still hangs, but doesn't have the the throttling so that's probably red herring
<anholt> your "passing" run is still clearly bad news, you've got new flakes.
<anholt> highly recommend being in #intel-ci and #zink-ci to see the stream of those
<anholt> I would grab a tgl, and c27.r1.caselist.txt from that first job, and see if you can run that caselist stably.
<anholt> (also, validation layer. note that I've got a WIP to do validation on zink on tgl so you don't have to worry about that as a dev)
<anholt> interesting that KHR-GL46.compute_shader.conditional-dispatching didn't fail in c27's first run, but did in the second. given its presence in the "passing" run, that feels very likely to be an important test to be looking at.
<anholt> glx-make-current just got kinda rewritten in piglit and we just merged that piglit to mesa. so perhaps it's now flaky on zink? haven't seen it in #zink-ci or issue #8759
<anholt> anyway, I'd start with sorting out the deqp fail before looking at glx-make-current
<dj-death> yep
<dj-death> so the thing is I grab 2 Gfx12 machines here, ran the entire GL46 CTS
<dj-death> no hang
<dj-death> one was TGL, the other ADL
<dj-death> tried the failing tests on simulation too, no issue
<anholt> did you download the caselist from mesa ci?
<dj-death> the main difference for me is that my TGL machine only has 8 threads rather than 9 on the CI
<anholt> and run specifically that caselist?
<dj-death> I'm using the one in tree
<dj-death> ah no, just the entire thing with deqp-runner
<dj-death> same command line
<anholt> it's really important to use the specific caselist when debugging.
<dj-death> yeah I know
<anholt> we save it for you for a reason :)
<dj-death> thanks
<dj-death> will retry
<anholt> Test case 'KHR-GL46.compute_shader.conditional-dispatching'..
<anholt> VUID-vkCmdCopyQueryPoolResults-dstBuffer-00824(ERROR / SPEC): msgNum: -632402239 - Validation Error: [ VUID-vkCmdCopyQueryPoolResults-dstBuffer-00824 ] Object 0: handle = 0x55c638e49860, name = zink cmdbuf, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xda4e4ec1 | vkCmdCopyQueryPoolResults() storage required (0x10) equal to dstOffset + (queryCount * stride) is greater than the size (0x8) of buffer (VkBuffer 0x3f5e210000007121[]). The Vulkan spec
<anholt> states: dstBuffer must have enough storage, from dstOffset, to contain the result of each query, as described here (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkCmdCopyQueryPoolResults-dstBuffer-00824)
<dj-death> oh..
<dj-death> I don't see that on my side
<anholt> that was on adl
<anholt> only appeared with the caselist, not a single testcase
<dj-death> right
<dj-death> nice.
<anholt> https://gitlab.freedesktop.org/anholt/mesa/-/jobs/39482058#L2845 <-- how my "let's validate zink-on-anv" in ci is going :(
<dj-death> hang reproduced
<anholt> \o/
<dj-death> well fence expiration
<dj-death> yeah got it
<dj-death> booo
<dj-death> yeah I introduced a command that actually waits on the completion in CmdCopyQueryResults
<dj-death> and it seems some queries are maybe not written to?
<dj-death> yep, copying a query that was never written to
<dj-death> that is very likely my issue