#dri-devel on 2022-01-05 — irc logs at oftc.irclog.whitequark.org

2021-07-26 22:56 ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar

00:01 <daniels> step 1: rebase it

00:01 <daniels> everything from there is line noise

00:02 <daniels> (there is an update to our most recent release in Insiders I believe, but some fixes still need cherry-picking)

00:02 * jekstrand isn't going to install an insider build

00:02 * jekstrand can wait for the eventual update

00:07 <daniels> even if it got you sweet tessellation shaders?

00:07 Company has joined #dri-devel

00:11 LexSfX has joined #dri-devel

00:34 rasterman has quit [Quit: Gettin' stinky!]

00:36 <alyssa> zmike: I was mimicking the style of the surrounding functions in that file

00:37 <alyssa> he started it, I swear!

00:38 pnowack has quit [Quit: pnowack]

00:45 <alyssa> u_box_clip_2d in the same file

01:05 tursulin has quit [Read error: Connection reset by peer]

01:06 <alyssa> idr: At risk of nerdsniping...

01:06 <alyssa> Do we optimize floating point booleans?

01:07 <alyssa> b2f32(inot(f2b1(x))) -> 1.0 - x

01:07 <alyssa> b2f32(iand(f2b1(x), f2b1(y))) -> fmul(x, y)

01:07 <alyssa> b2f32(ior(f2b1(x), f2b1(y))) -> fadd.sat(x, y)

01:08 <alyssa> ixor(f2b1(x), f2b1(y)) -> fneu(x, y)

01:08 <alyssa> etc

01:08 <alyssa> I have no idea if these patterns show up in any real shaders

01:09 <alyssa> But I'm revisiting boolean representations in the Bifrost/Valhall backend

01:09 camus1 has joined #dri-devel

01:10 <alyssa> The hardware supports 0/1, 0/~0, and 0/1.0f representations.

01:10 ahajda has quit []

01:10 <alyssa> Mapping the mess of conversions we get from NIR to efficient such code is nontrivial.

01:11 <alyssa> I am not interested in bringing up NOLTIS for this :-p

01:11 camus has quit [Ping timeout: 480 seconds]

01:12 <alyssa> At a local level, it's easy enough to fuse conversions with comparisons, i.e.

01:12 <alyssa> b2f32(flt(x, y)) -> flt.f(x, y)

01:13 <alyssa> That's not enough for optimal handling globally

01:13 <alyssa> But probably any such global optimization belongs in nir_opt_algebraic and not a backend pass that's too clever for its own good

01:14 <alyssa> Of course, there are even more obscure/obnoxious patterns our hardware allows for booleans..

01:14 <gawin> anholt: can you try to run glsl-fs-reflect from "gpu" on last MR? (branch r300-fixes) maybe this also helps with test I'm debugging now

01:15 <alyssa> Here's a delightfully obscure one...

01:15 <alyssa> Suppose X and Y are fp16 vec2.

01:15 <anholt> gawin: sorry, time for me to be done for today. I was just trying to get a stable baseline so I could test the NTT RA MR on r300

01:15 <alyssa> Recall on Bifrost, fp16 vec2 is packed into a 32-bit register and provides vectorized ALU, like GCN

01:16 <alyssa> `all(greaterThan(X, Y)) ? t : f`

01:17 <alyssa> At first blush this involves a lot of boolean representation conversions -- iand of 16-bit bools, zero extend that to a 32-bit bool, and feed that into the MUX.i32 instruction

01:17 <alyssa> but actually, I claim that can be just two instructions

01:17 <alyssa> FCMP.v2f16.i1 temp, X, Y

01:18 <alyssa> MUX.v2i16 f, t, temp

01:18 <alyssa> [Correction: `all` should have been `any`, and iand should have been ior]

01:18 <alyssa> [Correction 2: MUX.i32]

01:18 <alyssa> [MUX.i32 = roughly NIR bcsel]

01:19 columbarius has joined #dri-devel

01:19 <alyssa> What's happening? The comparison results in a vector of 16-bit booleans packed into a 32-bit register

01:19 <alyssa> That register is truthy if /either/ boolean is true, and falsey (0) if /both/ booleans are false.

01:20 <alyssa> So we don't need an actual ior+zero-extend, we just reinterpret the v2b16 as a b32... but that b32 isn't even valid (0x10001 as a truth value, possibly)

01:20 <alyssa> but it's still ... correct

01:20 co1umbarius has quit [Ping timeout: 480 seconds]

01:21 <alyssa> Anyway. These sorts of pathological tricks are making me give up on trying to do optimal boolean handling since that's looking to be AI-hard :-p

01:25 <alyssa> At least the old version I have of the Arm compiler doesn't do that silliness.

01:26 <alyssa> ...Cute. They implement b2f32 with i1 as U8_TO_F32, one instruction

01:27 <alyssa> and I assume b2f16 as V2U8_TO_V2F16 which is vectorized, wow! :-p

01:29 <alyssa> So maybe the real answer is "do as much optimization in NIR as we can, then do as much peephole optimization as we can to fuse stuff, and then just fallback to 0/1 booleans which are usually fast."

01:29 ngcortes has quit [Remote host closed the connection]

01:30 <alyssa> b2b32 i suppose is just ineg

01:30 <alyssa> b2i* is just zero extend

01:36 <gawin> anholt: also helps, noice

01:38 <alyssa> dschuermann: All of this depends on being able to vectorize comparisons of 16-bit floats, which nir_opt_vectorize can't do in packed vec2 mode...

01:41 Company has quit [Quit: Leaving]

01:47 alatiera has joined #dri-devel

01:55 <jekstrand> daniels: I like tessellation shaders and all but no.

01:55 <jekstrand> daniels: Also, pretty sure that I wouldn't need to build anything super-funky to get tessellation shaders. I remember the days of building wayland/wayland-protocols/weston/xwayland/whatever and don't want to go back.

01:55 <jekstrand> Why do you think I don't do WSI anymore?

02:03 nchery has quit [Ping timeout: 480 seconds]

02:17 <airlied> WSI will come and find you :-P

02:30 heat_ has quit []

02:30 heat has joined #dri-devel

02:30 heat has quit []

02:31 heat has joined #dri-devel

02:34 camus has joined #dri-devel

02:35 camus1 has quit [Remote host closed the connection]

02:45 alanc has joined #dri-devel

02:51 gawin has quit [Ping timeout: 480 seconds]

03:14 camus1 has joined #dri-devel

03:15 <ccr> sounds like a premise for a horror movie/story

03:18 fxkamd has quit []

03:19 camus has quit [Ping timeout: 480 seconds]

03:19 JohnnyonFlame has quit [Ping timeout: 480 seconds]

03:22 sagar_ has quit [Remote host closed the connection]

03:23 sagar_ has joined #dri-devel

03:27 remexre has quit [Remote host closed the connection]

03:28 remexre has joined #dri-devel

03:40 YuGiOhJCJ has joined #dri-devel

03:59 aravind has joined #dri-devel

04:02 YuGiOhJCJ has quit [Remote host closed the connection]

04:03 YuGiOhJCJ has joined #dri-devel

04:13 NiksDev has joined #dri-devel

05:27 Duke`` has joined #dri-devel

05:33 <jekstrand> airlied: Yeah, I know. "You can run but you can't hide" and all that. But I can run really fast. :-P

05:33 <jekstrand> Maybe I should to work on wifi... No window-systems there...

06:02 sdutt has quit [Remote host closed the connection]

06:10 <HdkR> jekstrand: Aperture windows are a type of window system.

06:24 mbrost has quit [Read error: Connection reset by peer]

06:40 itoral has joined #dri-devel

06:52 <idr> alyssa: I think the Intel compiler does some optimizations like that in the backend. In the b2f(inot(x)) case, we can use an integer addition that writes a floating point source.

06:52 <idr> It becomes a single instruction.

06:53 <idr> Like, 'add r16F, -r12D, 1D'

06:53 thellstrom1 has quit [Remote host closed the connection]

06:54 <idr> I might have some branches that try to do some of those other things, but I don't think any of them produced any interesting results.

06:55 thellstrom has joined #dri-devel

06:56 idr has quit [Quit: Leaving]

07:04 Wally has joined #dri-devel

07:05 Duke`` has quit [Ping timeout: 480 seconds]

07:06 lemonzest has joined #dri-devel

07:29 <krh> daniels: thanks for the compliment! on the other hand, I think I missed an opportunity for doing decorations in the style of weston-flower

07:30 <cworth> Hey, it's krh and idr (or at least it was idr...). Anyway, long time no see!

07:31 <krh> cworth: hi!

07:44 gouchi has joined #dri-devel

07:44 gouchi has quit []

07:55 tango_ is now known as Guest10331

07:55 tango_ has joined #dri-devel

07:55 Wally has quit [Remote host closed the connection]

08:00 heat has quit [Ping timeout: 480 seconds]

08:01 Guest10331 has quit [Ping timeout: 480 seconds]

08:13 mlankhorst has joined #dri-devel

08:18 <dschuermann> alyssa: amd doesn't have packed comparisons (but real single bit bools). feel free to add it to the vectorizer, though :P

08:20 <dschuermann> for the algebraic optimizations, try if you find any affected application. I could also have a look if you send me a branch :)

08:25 Major_Biscuit has joined #dri-devel

08:29 Major_Biscuit has quit []

08:29 MajorBiscuit has joined #dri-devel

08:32 ahajda has joined #dri-devel

08:53 ppascher has quit [Ping timeout: 480 seconds]

08:55 tursulin has joined #dri-devel

09:12 pcercuei has joined #dri-devel

09:22 <bbrezillon> I'm reading https://www.khronos.org/registry/vulkan/specs/1.2/html/chap6.html#commandbuffers-submission-progress and I'm not sure I understand one specific aspect correctly. When the spec says "Command buffers in the submission can include vkCmdWaitEvents commands that wait on events that will not be signaled by earlier commands in the queue. Such events must be signaled by the

09:22 <bbrezillon> application using vkSetEvent, and the vkCmdWaitEvents commands that wait upon them must not be inside a render pass instance. The event must be set before the vkCmdWaitEvents command is executed.", does that mean any CPU-signaled event should be signaled before the command buffers waiting on it are submitted to the queue?

09:23 <bbrezillon> if it does, wouldn't it imply that this deqp test https://github.com/KhronosGroup/VK-GL-CTS/blob/master/external/vulkancts/modules/vulkan/api/vktApiCommandBuffersTests.cpp#L3679 is wrong

09:23 <bbrezillon> ?

09:25 <bbrezillon> for the record, the panvk implementation currently assumes all CPU-signaled events are signaled before the submission of command buffers waiting on these events, and this test fails on panvk, but I'm not sure if it's the driver that's at fault or the test

09:37 mvlad has joined #dri-devel

09:46 tango_ is now known as Guest10340

09:46 tango_ has joined #dri-devel

09:46 camus has joined #dri-devel

09:48 thellstrom has quit [Quit: thellstrom]

09:49 thellstrom has joined #dri-devel

09:49 pnowack has joined #dri-devel

09:51 camus1 has quit [Ping timeout: 480 seconds]

09:53 Guest10340 has quit [Ping timeout: 480 seconds]

10:01 gawin has joined #dri-devel

10:21 camus1 has joined #dri-devel

10:21 pnowack has quit [Quit: pnowack]

10:22 pnowack has joined #dri-devel

10:25 camus has quit [Ping timeout: 480 seconds]

10:27 YuGiOhJCJ has quit [Quit: YuGiOhJCJ]

10:37 rasterman has joined #dri-devel

10:56 Major_Biscuit has joined #dri-devel

11:02 MajorBiscuit has quit [Ping timeout: 480 seconds]

11:10 <bnieuwenhuizen> bbrezillon: the existence of this VU suggests that assumption is wrong: "If pEvents includes one or more events that will be signaled by vkSetEvent after commandBuffer has been submitted to a queue, then vkCmdWaitEvents must not be called inside a render pass instance"

11:11 <bnieuwenhuizen> (on vkCmdWaitEvents)

11:13 <bnieuwenhuizen> I do vaguely remember the behavior being tightened a bit at one point to avoid enqueuing multiple cmdbuffers before setting an event for the first cmdbuffer, but not sure offhand how it was tightened

11:14 agd5f has quit [Ping timeout: 480 seconds]

11:39 <bbrezillon> bnieuwenhuizen: ok, I guess I mistaken 'executed' for 'queued' in the "The event must be set before vkCmdWaitEvents command is executed" sentence

11:40 <bbrezillon> but that also greatly complexifies the implementation since there's no native support for GPU-side memory-value polling in Midgard/Bifrost GPUs

11:41 <bnieuwenhuizen> hmm, missed that one

11:42 <bnieuwenhuizen> kinda hard to see how an app can reliably trigger it before execution but after queueing

11:42 <bnieuwenhuizen> bbrezillon: where is that in the spec?

11:42 <bbrezillon> I thought we could use syncobjs (what the current implementation does), but that doesn't work unless we instantiate a background thread to delay queuing until the event is signaled/set

11:43 <bbrezillon> bnieuwenhuizen: 3rd paragraph in https://www.khronos.org/registry/vulkan/specs/1.2/html/chap6.html#commandbuffers-submission-progress

11:43 <bbrezillon> last sentence

11:44 <bnieuwenhuizen> bbrezillon: ok, then I suspect the test is wrong now

11:47 flacks has quit [Quit: Quitter]

11:48 flacks has joined #dri-devel

11:51 <bbrezillon> bnieuwenhuizen: ok. The wording is a bit ambiguous though, and the fact that the deqp test delays the event signaling on purpose led me to think that this was a valid case, but maybe this part of the spec was added afterwards

12:02 <bbrezillon> apparently it's been added in 1.1.104 https://github.com/KhronosGroup/Vulkan-Docs/commit/476e3f422dc251065c535c1d8a5cfc58f1cff3c9

12:03 <bbrezillon> but it seems to be backported to all spec revisions

12:21 <cwabbott> bbrezillon: aren't syncobjs a kernel-level sort of thing? so are you splitting up submissions with events in them? that would be bad

12:22 <bbrezillon> yep

12:22 <cwabbott> events are supposed to be used as a more flexible alternative to pipeline barriers

12:22 <cwabbott> so inta-command-buffer signalling

12:22 <bbrezillon> and that's certainly inefficient

12:23 <bnieuwenhuizen> ah my guess would've been that you track which events are set in the same cmdbuffer and only use syncobj corss-cmdbuffer

12:23 <cwabbott> I think on mali events are "supposed" to map to an empty job

12:23 <bbrezillon> empty job?

12:24 <cwabbott> just a job that does nothing but adds dependencies

12:24 <bbrezillon> I mean, yes, it can serve as a dependency

12:24 <bbrezillon> but then you need to live patch the job to CPU-signal the event

12:25 <cwabbott> but by the time the job "executes" it's already been signalled

12:25 <bbrezillon> I was considering adding a compute job that polls a value in memory

12:26 <cwabbott> not 100% sure that works

12:26 <bbrezillon> oh, ok, so that doesn't really address the CPU-signaling issue I was reporting, but it make things more efficient by avoid a job chain split

12:26 <bbrezillon> *by avoiding

12:26 <cwabbott> but at least for simple things where the signalling and waiting happens in the same command buffer, I think letting the JS handle the dependencies directly would be best

12:27 <cwabbott> you might need to fall back to the compute shader if it's not signalled earlier in the same command buffer

12:28 <bbrezillon> absolutely, and we'll probably revise the implementation when we get a more or less functional 1.0 vulkan driver, but we wanted to keep things simple at first

12:28 <dj-death> bbrezillon: you have only one VkQueue in your implementation?

12:28 <bbrezillon> yes

12:29 <cwabbott> just be aware that you'll probably have a bunch of infrastructure for splitting up submissions that you're going to need to nuke

12:29 <cwabbott> and any time spent fixing bugs etc. there is going to be wasted time

12:29 <dj-death> bbrezillon: cool, so you don't have to worry about vkCmdSetEvent() ;)

12:30 <cwabbott> I'd rather go with the compute shader thing, since that's simpler and you're going to need it as a backup even with the more-optimal solution so it's not throw-away work

12:32 <dj-death> if your HW supports preemption, you can rely on that to have some kind of progression of other work

12:33 <bbrezillon> cwabbott: yeah, I'm well aware of that, on the other hand, we already need to split batches when there's too many jobs in the cmd buffer (16k jobs IIRC), so part of the batch splitting logic is needed anyway

12:33 <bbrezillon> and I suspect we need to keep syncobjs for event wait/set happening at the batch boundary anyway

12:34 <cwabbott> events explicitly aren't supposed to map to syncobjs

12:34 <bbrezillon> dj-death: it does, it's just not hooked-up kernel side

12:35 <cwabbott> semaphores are the vulkan equivalent for syncobjs

12:35 <dj-death> and fences

12:35 <bbrezillon> right, events are supposed to be signaled before the job is scheduled

12:35 <cwabbott> ah yeah, fences map to syncobjs too

12:36 <dj-death> bbrezillon: I don't think the events are supposed to be signaled before the job runs

12:36 <dj-death> can be later

12:36 <cwabbott> also, I thought you'd only submit 1 job per command buffer

12:37 <cwabbott> not sure why you'd need to split it up

12:37 <dj-death> bbrezillon: which is why you would need to spin in a compute shader

12:37 <bbrezillon> well, yes, one vertex/fragment job pair, that's the case most of the time, but the amount of jobs per chain is limited

12:37 <cwabbott> a command buffer should compile into a linked list of jobs

12:37 <bbrezillon> by the job id field

12:38 <cwabbott> oh, right, ok

12:38 <cwabbott> too bad :(

12:38 dreda has quit [Quit: Reconnecting]

12:38 <bbrezillon> and there's already one deqp test that reaches the limit

12:38 dreda has joined #dri-devel

12:38 <cwabbott> fun fun

12:38 dreda has quit []

12:39 dreda has joined #dri-devel

12:39 <bbrezillon> dj-death: yeah, I meant, for anything that's GPU-signaled, it should happen before the job is scheduled

12:39 dreda is now known as Guest10360

12:40 <dj-death> bbrezillon: yeah, depends on how many level of scheduling you have in HW ;)

12:40 <bbrezillon> 1 :)

12:40 <dj-death> bbrezillon: if it's just the kernel, yep

12:40 <bbrezillon> so, yes, in my case it's simple

12:40 <dj-death> ish ;)

12:40 <cwabbott> anyway, on every other HW impl an event is just a piece of memory which is written to and polled, so if you can do something like that you'll probably have way less breaking other people's expectations

12:40 Guest10360 is now known as dreda

12:41 <cwabbott> if you can figure out what command signals the event, you can just add it as a dependency instead of polling, of course

12:41 <dj-death> on our HW the command streamer read the value in memory and if it's not one, just raises an interrupt to tell i915 (now GuC) to say it's blocked

12:41 <bbrezillon> that's what I wanted to do, but then I realized it could monopolize a GPU core for an unbounded amount of time (which can be bounded if we add some sort of timeout in the compute shader), and I thought we could avoid that

12:42 <bbrezillon> tomeu: ^

12:42 <dj-death> then that scheduler can do something else

12:43 <bbrezillon> cwabbott: I can write a value with a WRITE_VALUE job (so no compute shader needed for CmdSetEvent())

12:43 <cwabbott> right

12:43 <bbrezillon> it's the polling that's problematic

12:43 <dj-death> HSW is the HW were we would need to spin, except we can't really preempt (iirc) so it sucks :)

12:43 <cwabbott> I think the best you can do there is the compute shader thing

12:45 <cwabbott> the point of an event is to act like a "separated barrier" where the waiting for earlier commands and signalling later commands is split up, so you can run other things in between

12:45 <cwabbott> but I don't think you'd run more than one or two things in between

12:45 <cwabbott> so ping-ponging to the kernel might really suck

12:46 <bbrezillon> yeah, I think that's part of the debate we had with tomeu back when he added event support to panvk

12:48 <bbrezillon> none of us were entirely happy with the syncobj approach, but it seemed simple enough to give it a try and get this feature implemented, knowing we'd have to rework it at some point

12:51 <bbrezillon> oh and I was worried about the compute job solution monopolizing a core, I admit. But given we already have a timeout for the entire job chain, maybe it's not such a big deal

12:51 <cwabbott> dj-death: is the scheduler really capable of running other stuff in parallel with the command buffer with the event? otherwise that would be pointless

12:52 <dj-death> cwabbott: yeah, on gen8+ the scheduler just swaps the current context for something else (if available)

12:55 <cwabbott> so you can have multiple contexts active at the same time?

12:56 <cwabbott> you can only wait after you signal events, so the current context is always doing something while you wait for the event

12:56 <dj-death> we only have a single context active at a time atm

12:56 <cwabbott> so then it's pointless

12:56 <dj-death> because there is only one engine

12:57 <dj-death> you can always put a different application on the GPU while the current one is blocked on an unsignaled event

12:59 <cwabbott> you can't though, because by the time the current context is done and you can switch to the next context your event is already signalled and its pointless

12:59 <cwabbott> you're not actually saving anything

13:00 <dj-death> I don't quite follow :(

13:00 <dj-death> if the VkEvent is going to be signaled from the CPU right after a sleep(10); call

13:01 <dj-death> you've got 10s of GPU time that can be used for anything else

13:01 <cwabbott> ah, but the event has to be signalled before the command executes

13:01 <cwabbott> if it's signalled on the CPU

13:01 lemonzest has quit [Quit: WeeChat 3.3]

13:01 <cwabbott> as per the text bbrezillon pointed out

13:01 <cwabbott> because it explicitly isn't designed for that use-case

13:03 lanodan_ has joined #dri-devel

13:04 <dj-death> cwabbott: I'm not sure this is right

13:04 <dj-death> cwabbott: just because of this VU :

13:04 <dj-death> "If pEvents includes one or more events that will be signaled by vkSetEvent after commandBuffer has been submitted to a queue, then vkCmdWaitEvents must not be called inside a render pass instance

13:04 <dj-death> "

13:05 <bnieuwenhuizen> dj-death: might be stale, because as pointed out the requirement about being signalled on the CPu before execuation was added later

13:05 lanodan has quit [Ping timeout: 480 seconds]

13:05 lemonzest has joined #dri-devel

13:06 <dj-death> I thought that only applied to GPU signaled events

13:06 <bnieuwenhuizen> "Such events must be signaled by the application using vkSetEvent, and the vkCmdWaitEvents commands that wait upon them must not be inside a render pass instance. The event must be set before the vkCmdWaitEvents command is executed."

13:06 <bnieuwenhuizen> it talking about vkSetEvent makes it about CPU signaled in my mind

13:07 nchery has joined #dri-devel

13:07 <bnieuwenhuizen> (especially with the earlier sentence "Command buffers in the submission can include vkCmdWaitEvents commands that wait on events that will not be signaled by earlier commands in the queue.")

13:10 <dj-death> I don't know :)

13:11 <dj-death> the test bbrezillon pointed to seems to do exactly that

13:11 <dj-death> (signal from host after submission)

13:12 <cwabbott> the test author maybe made the same mistake you did, or the test was added before that spec text was added

13:14 <cwabbott> I think "split pipeline barrier" is what events were always intended to be, but I guess at the beginning the spec wasn't super-clear on what was allowed

13:16 * dj-death files a khronos issue

13:17 <bnieuwenhuizen> yeah I think the spec was changed because before the spec-change the spec allowed submitting an unlimited number of cmdbuffers before signalling the event

13:27 <bbrezillon> cwabbott: correction, the maximum number of jobs per chain is 64k-1 (16bit field), not 16k, given a draw is formed by a vertex+tiler job pair, it's limited to 32k draws

13:28 <bbrezillon> anyway, I'll add this vkevent rework to my TODO-list

13:28 <bbrezillon> bnieuwenhuizen, cwabbott, dj-death: thanks for chiming in

13:30 <dj-death> 5~sure

13:31 <dj-death> issue 2971 in khronos if anybody's interested

13:36 Peste_Bubonica has joined #dri-devel

13:37 itoral has quit [Remote host closed the connection]

13:37 mwalle has joined #dri-devel

13:44 mlankhorst has quit [Remote host closed the connection]

14:23 <jenatali> What's even the point of having CPU signals for VkEvents if they need to be signaled before the command buffer is submitted?

14:24 <jenatali> Or, I guess if it's before the command is executed, but I don't see how you could CPU interop such that the command buffer starts executing, the CPU follows along, and signals the event before the wait is reached... but still what's the point, the wait will always be a no-op if it's for a CPU-signaled event

14:29 <dj-death> hopefully that'll be clarified next week ;)

14:29 <dj-death> I don't find much point either :(

14:30 sdutt has joined #dri-devel

14:37 <bnieuwenhuizen> jenatali: I guess the thing that is still enabled by this is having an event that can either be set by the GPU or the CPU

14:38 <bnieuwenhuizen> the GPU case obviously being the useful one, but if you don't want to do e.g. predecssor work at some point you can set it from the CPU

14:38 <bnieuwenhuizen> pretty contrived of course

14:38 <jenatali> Oh, I see. The ability to record a command buffer with a wait without having to care whether there will be a signaler submitted before it. If not, the CPU can just signal it to make it a no-op. I see

14:39 <alyssa> dschuermann: Alright, I assumed fp16vec2 meant packed comparisons. Probably trivial to add to the vectorizer but that means /more/ flags. So maybe should wait until the generic scalarization callbacks are a thing..

14:39 <bnieuwenhuizen> yeah. Pretty sure it is not what was originally envisioned though :)

14:42 Peste_Bubonica has quit [Quit: Leaving]

14:57 nchery has quit [Ping timeout: 480 seconds]

15:01 aravind has quit []

15:16 nchery has joined #dri-devel

15:16 <dschuermann> alyssa: with the callback function, it would be trivial. I'll see to update the MR

15:17 gawin has quit [Ping timeout: 480 seconds]

15:18 <alyssa> dschuermann: ack

15:19 <alyssa> in theory Mali also wants int8vec4

15:19 Daanct12 has joined #dri-devel

15:19 <alyssa> which again trivial with callback but not otherwise

15:19 <alyssa> (and that has all the same issues with OpenCL -- 8-bit vec16 should be partially scalarized to 4 vec4 ops)

15:20 fxkamd has joined #dri-devel

15:21 ella-0 has joined #dri-devel

15:25 ella-0_ has quit [Remote host closed the connection]

15:25 Danct12 has quit [Ping timeout: 480 seconds]

15:26 Daaanct12 has joined #dri-devel

15:28 Daaanct12 has quit [Remote host closed the connection]

15:29 Daaanct12 has joined #dri-devel

15:32 sdutt has quit []

15:32 sdutt has joined #dri-devel

15:33 Daanct12 has quit [Ping timeout: 480 seconds]

15:45 nchery has quit [Ping timeout: 480 seconds]

15:48 nchery has joined #dri-devel

15:51 Haaninjo has joined #dri-devel

16:06 remexre has quit [Ping timeout: 480 seconds]

16:11 bluebugs has quit [Read error: Connection reset by peer]

16:12 bluebugs has joined #dri-devel

16:12 <dschuermann> alyssa: regarding swizzles outside of vec width (e.g. iadd .xz, .xz): I'm not entirely sure if I should just avoid creating these or also care about lowering if they already exist

16:14 <dschuermann> I did encounter some iadd .xzyy thing. so, when lowering I could either create them temporarily, and lower further (which has the advantage of also catching the case if it already exists in source)

16:15 mbrost has joined #dri-devel

16:16 <dschuermann> or I could just avoid to create them which has the advantage of potentially better vectorization (e.g. iadd .xzzx could keep the middle channels, but that probably never happens)

16:21 jewins has joined #dri-devel

16:21 Duke`` has joined #dri-devel

16:22 bluebugs has quit [Quit: Leaving]

16:25 remexre has joined #dri-devel

16:25 Daanct12 has joined #dri-devel

16:29 bluebugs has joined #dri-devel

16:32 Daaanct12 has quit [Ping timeout: 480 seconds]

16:33 remexre has quit [Ping timeout: 480 seconds]

16:34 remexre has joined #dri-devel

16:37 ahajda has quit []

16:43 karolherbst_ has joined #dri-devel

16:43 <MrCooper> ishitatsuyuki: LatencyFleX looks nice, おめでとう

16:44 karolherbst has quit [Remote host closed the connection]

16:52 Danct12 has joined #dri-devel

16:52 gawin has joined #dri-devel

16:54 <jekstrand> bbrezillon: That's an espeically thorny corner.

16:55 <jekstrand> Early on, there were CTS tests that would vkQueueSubmit(), wait 1s or so, and then vkSetEvent(). Those have been deleted from the CTS.

16:56 <jekstrand> These days, I believe it's the responsibility of the client to call vkSetEvent() before the vkCmdWaitEvents() executes. So you can vkQueueSubmit() with a wait on a timeline semaphore and then vkSetEvent() followed by vkSignalSemaphore() to kick it off but that's the only sort of wait-before-signal you can get.

16:56 <jekstrand> Sorry if that was already covered. Didn't read the full scroll-back.

16:56 * jekstrand heads out to pick up his groceries.

16:59 <bnieuwenhuizen> jekstrand: the spec change was made but CTS test not removed AFAIU

16:59 Daanct12 has quit [Ping timeout: 480 seconds]

17:05 <airlied> bnieuwenhuizen: b45f4268074897cb4ee7da4b81a17b310301d77b removed some

17:07 <airlied> https://github.com/KhronosGroup/VK-GL-CTS/commit/35f2e29869ebdb643f7976971c89509a6647b656

17:07 <bnieuwenhuizen> ah we just have more tests to delete then

17:07 gawin has quit [Ping timeout: 480 seconds]

17:07 <airlied> I think the argument might be that they've put a timeout on the waits

17:08 <airlied> so it those tests might satisify the forward progress requirements

17:08 <bnieuwenhuizen> well, as stated it is still not valid behavior

17:08 dliviu has quit [Read error: No route to host]

17:08 <bnieuwenhuizen> "The event must be set before the vkCmdWaitEvents command is executed." is pretty unambiguous

17:09 dliviu has joined #dri-devel

17:14 dliviu has quit []

17:16 dliviu has joined #dri-devel

17:16 <alyssa> dschuermann: 🤷

17:25 loki_val has joined #dri-devel

17:25 dliviu has quit []

17:27 crabbedhaloablut has quit [Ping timeout: 480 seconds]

17:31 <dschuermann> alyssa I updated the MR, have a look :)

17:31 <alyssa> dschuermann: Looking now :)

17:31 <alyssa> "afterwards checking if all swizzles are in bounds. If that is not the case, the targeted width gets cut in half"

17:31 <alyssa> I'm not 100% sure what you mean by this

17:32 dliviu has joined #dri-devel

17:33 <alyssa> case nir_op_ishl: /* TODO: in NIR, these have 32bit shift operands */

17:33 <alyssa> case nir_op_ishr: /* while Radeon needs 16bit operands when vectorized */

17:33 <alyssa> Oh you also have the "NIR shifts can't be vectorized" problem like us :(

17:35 rpigott has quit [Remote host closed the connection]

17:36 <dschuermann> yes :/

17:37 rpigott has joined #dri-devel

17:39 <anholt> if anyone has some time for helping nir-to-tgsi along, https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14404 is the thing we need for r600 and nv30.

17:40 gouchi has joined #dri-devel

17:42 Major_Biscuit has quit [Ping timeout: 480 seconds]

17:43 iive has joined #dri-devel

17:44 <dschuermann> alyssa the lowering passes revisit the lowered instructions before they continue

17:45 <alyssa> anholt: I don't understand TGSI enough to feel comfortable reviewing, sorry :(

17:46 <alyssa> I think I agree doing RA with tgsi regs and not NIR defs is the right approach, so R-b on the NIR delete if nothing else 😅

18:02 ybogdano has joined #dri-devel

18:04 Major_Biscuit has joined #dri-devel

18:09 rpigott has quit [Read error: Connection reset by peer]

18:10 rpigott has joined #dri-devel

18:16 Major_Biscuit has quit [Ping timeout: 480 seconds]

18:24 idr has joined #dri-devel

18:33 <austriancoder> anholt: I have updated my piglit MR. The plans is land it, update piglit in mesa and rebase my arb_shadow stuff on top of it to update the CI passes.

18:37 gawin has joined #dri-devel

18:38 <anholt> austriancoder: \o/

18:50 xlei_ has joined #dri-devel

18:52 mattrope has joined #dri-devel

18:53 xlei_ has left #dri-devel [#dri-devel]

18:53 <alyssa> dschuermann: thanks for the patch, here's more work to do :-p

18:53 xlei_ has joined #dri-devel

18:54 xlei is now known as Guest10400

18:54 xlei_ is now known as xlei

18:54 Guest10400 has quit [Ping timeout: 480 seconds]

18:57 camus has joined #dri-devel

18:59 camus1 has quit [Read error: Connection reset by peer]

19:00 nchery has quit [Ping timeout: 480 seconds]

19:11 nchery has joined #dri-devel

19:19 camus1 has joined #dri-devel

19:20 camus has quit [Read error: Connection reset by peer]

19:37 sagar_ has quit [Ping timeout: 480 seconds]

19:44 pnowack has quit [Quit: pnowack]

19:50 sagar_ has joined #dri-devel

19:52 <alyssa> anholt: dEQP-GLES31.functional.ssbo.layout.random.all_shared_buffer.36 is so cursed :v

19:53 <anholt> would sure be nice if we could improve its perf some more

19:53 <alyssa> IIRC that hammers RA

19:57 <jenatali> Ugh I'm getting what I think is coordinate space mismatches with tess between GL and D3D but I can't quite figure it out

19:58 <jenatali> For some reason in the quad tessellation shader test I'm getting irregular-sized quads

20:01 heat has joined #dri-devel

20:02 <anholt> alyssa: wait. I'm not getting slow results on running it singly on radeonsi, crocus, or virgl right now.

20:02 <alyssa> Hm

20:03 <anholt> ah, 0f11aa915a86dd9fb86eedc1d54a90537693701b

20:03 <anholt> in the CTS

20:04 <anholt> which is blocked on uprev due to llvmpipe

20:04 <anholt> (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13779)

20:04 <alyssa> what commit is 0f11aa915a86dd9fb86eedc1d54a90537693701b?

20:04 <anholt> in deq

20:04 <anholt> p

20:06 <alyssa> ah https://github.com/KhronosGroup/VK-GL-CTS/commit/0f11aa915a86dd9fb86eedc1d54a90537693701b?

20:06 <alyssa> "Limit dimensionality of arrays-of-arrays in random SSBO tests"

20:06 <alyssa> Yeah that'll do it

20:06 <alyssa> anholt: I am conflicted on that commit

20:06 <alyssa> On one hand, that should shave some time off our deqp-gles31 runs in CI

20:07 <alyssa> On the other hand, that test case is the best RA torture test in deqp-gles

20:07 <anholt> spirv_ids_abuse is pretty good too.

20:08 <alyssa> that's vk

20:08 <anholt> well, get on it.

20:08 <anholt> :D

20:08 * anholt says the person who's been yakshaving fixing ancient gl drivers for a few weeks.

20:20 pnowack has joined #dri-devel

20:21 flto has quit [Quit: Leaving]

20:21 ngcortes has joined #dri-devel

20:23 LexSfX has quit []

20:26 <alyssa> remind me what the goal of the delete tgsi effort was again?

20:26 <alyssa> "deleting tgsi"

20:26 <alyssa> ah right of course

20:29 LexSfX has joined #dri-devel

20:30 ppascher has joined #dri-devel

20:32 mvlad has quit [Remote host closed the connection]

20:37 flto has joined #dri-devel

20:49 <gawin> anholt: if some opcodes don't like unused channels (for example "if tmp[0].x___" ), then it's best to handle this at first step (blacklist at pass which is generating "unused") or last step when feeding to hardware?

20:49 <anholt> gawin: I don't follow what you mean there

20:50 <gawin> "if tmp[0].xxxx" is ok, "if tmp[0].x___" is not

20:51 <anholt> I would think that one would want to have if tmp[0].x___ after TGSI generation, and then expand that first channel out when emitting the actual shader code.

20:51 <alyssa> ^^

20:51 <alyssa> if (vector) doesn't make any sense

21:09 mbrost has quit [Ping timeout: 480 seconds]

21:09 <alyssa> idr: much less ridiculous, I wonder if ineg(b2iN) -> b2bN helps anything

21:12 <idr> alyssa: That is another rabbit hole. :)

21:13 <idr> I don't think I actually tried that, but I believe the lack of algebraic optimizations for b32 would be problematic.

21:14 <idr> I suggested some things along those lines at the last f2f XDC, but I was told it was a stupid idea.

21:15 <cwabbott> alyssa: I just tried, and apparently that test takes 0.8 seconds on a release build of freedreno

21:15 <cwabbott> perks of SSA-based register allocation?

21:15 <anholt> cwabbott: which cts are you on?

21:15 <cwabbott> uhh

21:15 <cwabbott> 1.2.8

21:16 <anholt> that's got the cts fix that makes it fast :)

21:16 <cwabbott> whoops :9

21:16 <cwabbott> :(

21:16 <anholt> it was slow in the middle of nir, last I remember looking at it

21:17 <alyssa> idr: Ah well yes most of these are stupid ideas I'm just trying to get a feel for what our hw can do

21:17 <cwabbott> was that the one of the ones that hit quadraticness in variable copy propagation?

21:17 <anholt> I don't have an issue filed for it, so I'm thinking it was more of a papercuts

21:17 <alyssa> ..isn't copy prop linear in SSA?

21:17 <anholt> but I sure would love to see the copy propagation fixed somehow

21:17 <alyssa> oh, /variable/. right

21:20 mbrost has joined #dri-devel

21:35 pendingchaos has quit [Ping timeout: 480 seconds]

21:42 pendingchaos has joined #dri-devel

21:43 gouchi has quit [Remote host closed the connection]

21:45 jewins has quit [Ping timeout: 480 seconds]

21:47 LaserEyess has quit [Quit: fugg]

21:47 LaserEyess has joined #dri-devel

21:49 <alyssa> dschuermann: phi(ult(u16vec2, u16vec2), ult(u8vec2, u8vec2))

21:49 <alyssa> this expression spells doom for any of my "do booleans good" plans for mali

21:50 <alyssa> there does not exist a canonical representation for that boolean

21:50 <alyssa> vector

21:51 <alyssa> well, the real nail in the coffin is

21:51 <alyssa> phi(ult(u16vec2, u16vec2), ult(u8vec2, u8vec2)).y

21:51 <alyssa> is that byte 1 or byte 2?

21:51 <alyssa> well.. it depends which side of the branch you were on...

21:53 <alyssa> Maybe the real answer here is swallow my pride and use lower_bools_to_bitsize

21:53 <alyssa> instead of trying to do this in the backend

21:55 <alyssa> Incidentally, flt8/fneu8/fge8/feq8 are nonsensical.

21:55 <alyssa> I mean they're well defined but. Seem to be there mostly by accident

21:55 <alyssa> likewise b8all_fequal

21:59 <alyssa> ugh am I really going to rewrite my entire series, again?

21:59 <alyssa> Maybe. Maybe I am.

22:01 <alyssa> I think it's the sane choice after everything. Sigh

22:01 <alyssa> Oh, and I'm even one of the Reviewed-by tags on the pass! So I can't claim I didn't know better :-p

22:03 rasterman has quit [Quit: Gettin' stinky!]

22:06 Haaninjo has quit [Quit: Ex-Chat]

22:08 Duke`` has quit [Ping timeout: 480 seconds]

22:09 heat has quit [Remote host closed the connection]

22:10 <austriancoder> alyssa: that's normal.. from time to time I am not sure how to do things the right way and/or I can not remember that I worked on a topic weeks/months ago until I see my name on the change

22:12 <alyssa> austriancoder: I guess I forgot about lower_bool_to_bitsize because nobody uses it.

22:12 <austriancoder> It feels like being a complete newbee (to everything)

22:12 <alyssa> Ok, technically etna uses it, but I'm pretty sure that was a tpyo and you meant to use lower_bool_to_int32 :-p

22:12 <austriancoder> alyssa: etnaviv uses it.. I think

22:12 <alyssa> does etna support fp16?

22:13 <austriancoder> no

22:13 <alyssa> then you meant to type int32

22:13 <alyssa> though the two passes are identical if you don't support fp16 (yet)

22:14 <dschuermann> alyssa with 1-bit bools that would be entirely aweful... accessing every second bit? or accessing the upper half? we already have to emit a bunch or AND/OR to do the normal boolean phis

22:20 <austriancoder> alyssa: maybe .. it fixed a problem so I used it.

22:20 NiksDev has quit [Ping timeout: 480 seconds]

22:37 <alyssa> dschuermann: Nod. lower_bool_to_bitsize chews through it just fine, and has code to handle that case.

22:38 <alyssa> critically, lower_bool_to_bitsize runs before we go out of SSA, whereas anything we do in the backend runs after going out of SSA

22:38 <alyssa> after going out of SSA all bets are off to handle those cases

22:38 <alyssa> (I realize ACO doesn't have this problem :-p)

22:40 <dschuermann> :P

22:40 <alyssa> dschuermann: rude :p

22:40 <dschuermann> you'll want to cover all instructions which NIR lowers eventually in the vectorize_cb as well

22:40 <alyssa> cover?

22:41 <dschuermann> take care that they are kept vectorized

22:41 <alyssa> bifrost IR is 100% vectorized for 8/16-bit

22:41 <alyssa> (scalar instructions in NIR are forcibly replicated)

22:42 <alyssa> it's.. kind of annoying actually

22:42 <dschuermann> yeah, you can remove thazlt stuff after the series is merged

22:42 <dschuermann> that*

22:42 <alyssa> Hmm?

22:42 <alyssa> I'm so lost

22:44 <dschuermann> ok slowly: you will be able to scalarize these instructions in NIR, so you don't have to do that in the backend

22:44 <alyssa> other way around

22:44 <alyssa> Bifrost /does not/ support scalar 16-bit operation

22:44 <dschuermann> ohhh :D

22:44 Mooncairn has joined #dri-devel

22:44 <alyssa> 16-bit instructions are /always/ vec2

22:44 <alyssa> and if NIR asks for a scalar we just throw away the upper half

22:45 <dschuermann> even transcendentals?

22:45 soreau has quit [Remote host closed the connection]

22:45 <alyssa> uhhh

22:45 <alyssa> except transcendentals

22:45 * alyssa whistles

22:45 <dschuermann> yeah, so you can scalarize these ;)

22:45 <alyssa> we already can

22:45 soreau has joined #dri-devel

22:46 <dschuermann> not without breaking the vectorization chain

22:46 <alyssa> lower_alu_to_scalar + vectorize filter does it for us

22:46 <alyssa> Ah, well, sure

22:46 <alyssa> so I can expect moar fps with the series then?

22:46 <alyssa> but not any different backend code :)

22:46 <dschuermann> oh, didn't think about that

22:46 <dschuermann> so, you vectorize everything and lower some instructions afterwards?

22:47 <alyssa> scalarize everything and vectorize some afterwards

22:47 <alyssa> (in NIR)

22:47 <alyssa> this is stupid but it's the best we can do until the smart lower_to_scalar is merged

22:47 <dschuermann> yeah, so my experience was that it's better to keep the instructions vectorized

22:47 <alyssa> yes

22:48 <alyssa> hence why I nerdsniped you into the smart lower_to_scalar patches

22:48 <alyssa> and then made you think they were your idea?

22:48 <alyssa> :_p

22:48 <alyssa> ("I don't remember that part." "Hmmm...")

22:48 <dschuermann> haha :P

22:48 <dschuermann> you should see the AMD FSR shaders with that series

22:48 <dschuermann> pure gold

22:48 Mooncairn has left #dri-devel [Leaving]

22:49 <alyssa> fsr?

22:50 <dschuermann> fidelityFX(tm) upscaling stuff

22:57 iive has quit []

23:24 <mareko> does anybody want GL_OVR_multiview?

23:31 <HdkR> According to my logs there has been sparse interest periodically. With the idea that most users have moved to Vulkan

23:33 mbrost has quit [Ping timeout: 480 seconds]

23:37 pcercuei has quit [Quit: dodo]

23:45 stuartsummers has quit []

23:47 karolherbst_ has quit [Quit: Konversation terminated!]

23:48 <alyssa> why is fcsel_gt a thing

23:49 <alyssa> Added for r600 I guess

23:49 <anholt> nir_to_tgsi would love to have it, too. but it needs it without icsel_gt :/

23:49 <alyssa> not sure why you wouldn't do that fuse in the backend, though?

23:49 <alyssa> What's the benefit of doing it in NIR?

23:50 <alyssa> I guess opt_algebraic is nice

23:50 <anholt> probably the instruction selection ease. having written comparison-walking code like that several times, it's no fun.

23:50 <alyssa> yeah, I get that

23:51 rasterman has joined #dri-devel

23:51 <alyssa> my views on this have shifted over time ... it's inevitable that mature backends write that code themselves since every hw has slightly different instructions available

23:51 <alyssa> (whether manually walking or something like noltis)

23:51 <idr> anholt, alyssa: I have a branch somewhere the splits the float and integer csel cases and adds the missing comparisons.

23:52 <alyssa> but for young backends (or backends for extremely simple hardware), it is quite tempting to just map 1:1 to NIR

23:52 <alyssa> (if "only" NIR supported this one extra instruction.. and this one.. and..)

23:53 <anholt> I'm hoping the ntt RA branch takes some pressure off of the "make nir ops to match all tgsi ops"

23:55 <alyssa> to be fair with the ntt RA branch, ntt becomes a compiler in its own right (as opposed to a NIR translator)

23:56 <alyssa> so if ntt grows an optimizer in the mean time... meh

23:58 <alyssa> dschuermann: convinced me good-enough linear time backend isel doesn't have to be hard or require augmenting the IR if you have SSA

23:58 <alyssa> not saying it's the way to go for ntt, I'm just a lot less scared of not doing everything in NIR

23:58 <alyssa> ^than I was when bringing up my first NIR backend