#zink on 2023-04-06 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:44 ChanServ changed the topic of #zink to: official development channel for the mesa3d zink driver || https://docs.mesa3d.org/drivers/zink.html

10:46 Sachiel_ has joined #zink

10:46 Sachiel has quit [Ping timeout: 480 seconds]

11:09 Sachiel_ has quit [Ping timeout: 480 seconds]

11:09 Sachiel_ has joined #zink

12:21 avane has quit [Ping timeout: 480 seconds]

12:28 avane has joined #zink

13:12 sinanmoh- has joined #zink

13:12 sinanmohd has quit [Read error: Connection reset by peer]

14:28 Sachiel_ is now known as Sachiel

17:22 <dj-death> anholt: wondering if you can help me make sense of a zink ci failure

17:23 <dj-death> anholt: here is the failure : https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39468117

17:23 <dj-death> anholt: and that's a passing run : https://gitlab.freedesktop.org/llandwerlin/mesa/-/jobs/39463706

17:23 <dj-death> anholt: apparently both run into a hang

17:23 <dj-death> anholt: that sounds odd, is that expected?

17:35 <dj-death> another interesting failure : https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39480051

17:35 <dj-death> 2023-04-06 16:40:05.459149: ERROR - Piglit error: glx-make-current: ../src/vulkan/runtime/vk_image.c:57: vk_image_init: Assertion `pCreateInfo->extent.width > 0' failed.

17:57 <anholt> dj-death: the obvious line to me in the failing job log is 2023-04-06 14:44:50.293979: [ 650.772497] Fence expiration time out i915-0000:00:02.0:glcts[2102]:18d32!

17:58 <anholt> dj-death: and right before that the gpu was throttled to 150 mhz

17:59 <anholt> retry still hangs, but doesn't have the the throttling so that's probably red herring

18:00 <anholt> your "passing" run is still clearly bad news, you've got new flakes.

18:01 <anholt> highly recommend being in #intel-ci and #zink-ci to see the stream of those

18:02 <anholt> I would grab a tgl, and c27.r1.caselist.txt from that first job, and see if you can run that caselist stably.

18:02 <anholt> (also, validation layer. note that I've got a WIP to do validation on zink on tgl so you don't have to worry about that as a dev)

18:04 <anholt> interesting that KHR-GL46.compute_shader.conditional-dispatching didn't fail in c27's first run, but did in the second. given its presence in the "passing" run, that feels very likely to be an important test to be looking at.

18:08 <anholt> glx-make-current just got kinda rewritten in piglit and we just merged that piglit to mesa. so perhaps it's now flaky on zink? haven't seen it in #zink-ci or issue #8759

18:08 <anholt> anyway, I'd start with sorting out the deqp fail before looking at glx-make-current

18:11 <dj-death> yep

18:11 <dj-death> so the thing is I grab 2 Gfx12 machines here, ran the entire GL46 CTS

18:11 <dj-death> no hang

18:11 <dj-death> one was TGL, the other ADL

18:12 <dj-death> tried the failing tests on simulation too, no issue

18:13 <anholt> did you download the caselist from mesa ci?

18:13 <dj-death> the main difference for me is that my TGL machine only has 8 threads rather than 9 on the CI

18:13 <anholt> and run specifically that caselist?

18:13 <dj-death> I'm using the one in tree

18:13 <dj-death> ah no, just the entire thing with deqp-runner

18:13 <dj-death> same command line

18:13 <anholt> it's really important to use the specific caselist when debugging.

18:14 <dj-death> yeah I know

18:14 <anholt> we save it for you for a reason :)

18:14 <dj-death> what file is that? : https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39480051/artifacts/browse/results/

18:15 <anholt> in https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39468117/artifacts/browse/results/

18:15 <dj-death> thanks

18:15 <dj-death> will retry

18:18 <anholt> Test case 'KHR-GL46.compute_shader.conditional-dispatching'..

18:18 <anholt> VUID-vkCmdCopyQueryPoolResults-dstBuffer-00824(ERROR / SPEC): msgNum: -632402239 - Validation Error: [ VUID-vkCmdCopyQueryPoolResults-dstBuffer-00824 ] Object 0: handle = 0x55c638e49860, name = zink cmdbuf, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xda4e4ec1 | vkCmdCopyQueryPoolResults() storage required (0x10) equal to dstOffset + (queryCount * stride) is greater than the size (0x8) of buffer (VkBuffer 0x3f5e210000007121[]). The Vulkan spec

18:18 <anholt> states: dstBuffer must have enough storage, from dstOffset, to contain the result of each query, as described here (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkCmdCopyQueryPoolResults-dstBuffer-00824)

18:18 <dj-death> oh..

18:18 <dj-death> I don't see that on my side

18:18 <anholt> that was on adl

18:20 <anholt> only appeared with the caselist, not a single testcase

18:20 <dj-death> right

18:20 <dj-death> nice.

18:21 <anholt> https://gitlab.freedesktop.org/anholt/mesa/-/jobs/39482058#L2845 <-- how my "let's validate zink-on-anv" in ci is going :(

18:27 <dj-death> hang reproduced

18:31 <anholt> \o/

18:31 <dj-death> well fence expiration

18:39 <dj-death> yeah got it

18:39 <dj-death> booo

18:51 <dj-death> yeah I introduced a command that actually waits on the completion in CmdCopyQueryResults

18:51 <dj-death> and it seems some queries are maybe not written to?

18:56 <dj-death> yep, copying a query that was never written to

19:46 <dj-death> anholt: narrowed down : https://gitlab.freedesktop.org/mesa/mesa/-/issues/8798

19:46 <dj-death> that is very likely my issue