#zink on 2022-03-18 — irc logs at oftc.irclog.whitequark.org

2021-07-26 22:56 ChanServ changed the topic of #zink to: official development channel for the mesa3d zink driver || https://docs.mesa3d.org/drivers/zink.html

03:14 LexSfX has quit []

03:22 LexSfX has joined #zink

06:43 LexSfX has quit []

12:55 <zmike> ajax: I ran the thing several times yesterday with kopper and didn't hit that issue, so I'm of the opinion that it's not worth looking into further unless you're able to reproduce it with the full series

14:48 * ajax reupps

14:51 <ajax> [xcb] Unknown sequence number while processing queue

14:51 <ajax> [xcb] Most likely this is a multi-threaded client and XInitThreads has not been called

14:51 <ajax> [xcb] Aborting, sorry about that.

14:51 <ajax> glxgears: xcb_io.c:269: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.

14:52 <ajax> >_<

14:52 <zmike> yeah that's still the drisw xlib thing

14:52 <zmike> or at least it looks similar

14:53 <ajax> lovely

14:53 <zmike> this is glxgears?

14:54 <ajax> yeah, just hold down an arrow key to rotate them and it'll die eventually

14:54 <ajax> clearly something in event handling is Wrong

14:54 <zmike> TIL you can rotate them

14:55 <zmike> I just hit the same error in glmark2 though

14:55 <ajax> and, glmark2 still dies with a disconnect after the first [build] scene

14:55 <zmike> hm I got the exact same error you posted in mine

14:55 <zmike> it's weird that it only happens in these types of apps

14:56 <ajax> they're both xcb/xlib internal asserts, they're probably the same root cause just happening in different libraries

14:56 <zmike> yea

14:58 <ajax> so this machine is an intel coffeelake on the motherboard that is polite enough to stay active even if you plug in an rx480

14:59 <zmike> wow

14:59 <ajax> i'm using the radeon for display but i can DRI_PRIME= and test through anv too

14:59 <zmike> handy

14:59 <ajax> and that kinda runs glmark2 okay, like it's not crashing

14:59 <ajax> but it is doing this, which makes me very suspicious:

14:59 <ajax> [texture] texture-filter=nearest: FPS: 1006 FrameTime: 0.994 ms

14:59 <ajax> Error: glXCreateNewContext failed

14:59 <ajax> Error: CanvasGeneric: Invalid EGL state

14:59 <zmike> wat

14:59 <ajax> srs

15:00 <ajax> like, nothing we did touches that path

15:01 * ajax wields valgrind

15:01 <zmike> I'm assuming it's just the mismatch of using xlib+drisw in frontend and xcb in wsi

15:01 <zmike> somehow

16:11 <kusma> zmike: Seems !15429 fixed the crashes in dEQP-EGL.functional.color_clears.single_context.gles2.rgb888_window, but Valgrind is still not quite happy: https://gitlab.freedesktop.org/-/snippets/5071

16:11 <kusma> Seems we're still reading a free'd memory...

16:13 <zmike> I still don't know where you're getting that test from

16:13 <zmike> it doesn't exist for me

16:15 <zmike> wondering if I should just put in a bo walk on context free to avoid having more rube goldberg code that impacts perf

16:19 <anholt> you have deqp-egl, right?

16:31 <kusma> I get it from 501679ad2d24cbfbd70c35ec459034a7cde41a82 (HEAD, tag: opengl-cts-4.6.2.0)

16:32 <kusma> It's listed in the egl-master.txt files

16:32 <kusma> zmike: ^

17:08 <zmike> do I need to build with some special flag or something?

18:09 <anholt> check build-deqp.sh for how it's built.

18:23 <kusma> zmike: just fell out of my build without anything special...

18:24 <ajax> zmike: context release is not a performance path, i don't think

18:36 <zmike> checking build scripts...

18:36 <zmike> ajax: right, that's why I was considering skipping any further and likely more complex "fixes" for scenarios that can only occur in the presence of context destruction

18:41 <zmike> I have deqp-egl

18:42 <zmike> hm I see now

18:43 <zmike> I've somehow broken my cts build

18:43 <zmike> there we go

18:52 <ajax> so i think i'm having a stroke

18:52 <zmike> what's your address so I can call emergency services

18:53 <Sachiel> sure thing, next you'll be asking for his SSN and mother's maiden name

18:54 <zmike> whoa whoa I'm not a US marshal or anything

18:58 <zmike> kusma: I've now run that test on main and kopper, in valgrind and asan, and I have no issues

19:00 <ajax> so i've got Xwayland and glmark both under gdb

19:01 <ajax> and i've hacked libGL to actually throw an error when CreateNewContext fails instead of silently absorbing it

19:02 <ajax> and on the Xwayland side i have a breakpoint set on the GLXIsDirect handler that ouht to be generating the GLXBadContext on GLXIsDirect that i'm seeing, in the xcb error

19:02 <ajax> but: i hit the error handler in xlib before/without hitting __glXDisp_IsDirect

19:03 <ajax> in fact, without even hitting the __glXDisp_CreateNewContext corresponding to the request immediately above

19:04 <zmike> this feels like deja vu

19:04 <zmike> I've had similar things happen before

19:05 <ajax> i think there's something subtly wrong about how libglx is switching back and forth between xlib and xcb

19:05 <zmike> are you trying to tell me that not even glx is safe from zink?

19:06 <ajax> glx isn't really safe full stop

19:06 <zmike> hahahah

19:06 <anholt> I think they hadn't invented the idea of safe back then.

19:08 <ajax> XIO: fatal IO error 62 (Timer expired) on X server ":0"

19:08 <ajax> now that's a new one

19:24 <kusma> zmike: Hmm, odd... I'm seeing this on AMD, BTW.

19:34 <ajax> fixed it

19:34 <ajax> absolutely despise the fix, but fixed it

19:35 <ajax> like: if this is the kind of thing we need to do, hoo boy do we need to do a lot more of it

19:35 <zmike> uh oh

19:37 <ajax> https://paste.centos.org/view/raw/9765a49a

19:37 <zmike> seems bad

19:37 <zmike> you sure this can't be fixed by just not using xlib in glx?

19:38 <zmike> I thought we'd agreed that was likely to be the root of all evil previously

19:38 <ajax> there's a load-bearing "just" in that sentence

19:38 <ajax> i mean yeah i'd love to, it's just,

19:39 <zmike> well sure

19:40 <zmike> anyway, that seems awful and I'm glad you're the one who found it and not me

19:40 <zmike> not sure I can take another one of those this week

19:41 <ajax> i need to figure out why exactly that helps because it's clearly something to do with the xlib and xcb halves of the brain not talking to each other

19:42 <ajax> and, there's a lot of that

19:42 <zmike> :/

19:45 <ajax> otoh. there's only like 60 calls to GetReq in libglx.

19:45 <ajax> maybe this is less odious than i think

19:48 <ajax> but. libGL bakes in libX11 to its abi, because everything takes a Display * not an xcb_connection_t *

19:49 <ajax> so i'm still at the risk of needing to do whatever this handshake dance is on the way in to every GLX function, since something else in the process is probably using xlib

19:49 <ajax> which, fine, that's __glXSetupForCommand anyway, but ugh this is all horrible.

19:50 <zmike> not ideal at all

19:50 <ajax> can i have a beer yet

19:50 <zmike> rb

20:06 <ajax> so one insight here is, the reason i don't hit __glXDisp_IsDirect is because that's not where the GLXBadContext is generated

20:07 <ajax> because i have glvnd enabled in my Xwayland build for whatever reason, which means vnd is trying to look up the glx provider for the screen based on the incoming context's xid

20:08 <ajax> and not finding it, because i have indeed also not seen a GLXCreateNewContext request, it's still languishing in xlib's send buffer

20:09 <ajax> hence the flush fixing things: after XFlush xlib's wirte queue is empty, so when xcb sends the GLXIsDirect it's after the GLXCreateNewContext from the xlib side

20:10 <ajax> but you would have hoped that merely calling into xcb at all would have triggered the whole release callback bit of xcb_take_socket

20:11 <ajax> so... i hate it

20:22 <ajax> i have congealed the hate into a merge request

21:48 <zmike> my brain

23:21 <kusma> Uuuh, is 3472fed4da18d99622517af5aa5c32b1f797c299 correct? What happens if the vertex buffer is reused across multiple batches without re-binding?

23:51 xroumegue has quit [Ping timeout: 480 seconds]