<alyssa>
I have no idea if these patterns show up in any real shaders
<alyssa>
But I'm revisiting boolean representations in the Bifrost/Valhall backend
camus1 has joined #dri-devel
<alyssa>
The hardware supports 0/1, 0/~0, and 0/1.0f representations.
ahajda has quit []
<alyssa>
Mapping the mess of conversions we get from NIR to efficient such code is nontrivial.
<alyssa>
I am not interested in bringing up NOLTIS for this :-p
camus has quit [Ping timeout: 480 seconds]
<alyssa>
At a local level, it's easy enough to fuse conversions with comparisons, i.e.
<alyssa>
b2f32(flt(x, y)) -> flt.f(x, y)
<alyssa>
That's not enough for optimal handling globally
<alyssa>
But probably any such global optimization belongs in nir_opt_algebraic and not a backend pass that's too clever for its own good
<alyssa>
Of course, there are even more obscure/obnoxious patterns our hardware allows for booleans..
<gawin>
anholt: can you try to run glsl-fs-reflect from "gpu" on last MR? (branch r300-fixes) maybe this also helps with test I'm debugging now
<alyssa>
Here's a delightfully obscure one...
<alyssa>
Suppose X and Y are fp16 vec2.
<anholt>
gawin: sorry, time for me to be done for today. I was just trying to get a stable baseline so I could test the NTT RA MR on r300
<alyssa>
Recall on Bifrost, fp16 vec2 is packed into a 32-bit register and provides vectorized ALU, like GCN
<alyssa>
`all(greaterThan(X, Y)) ? t : f`
<alyssa>
At first blush this involves a lot of boolean representation conversions -- iand of 16-bit bools, zero extend that to a 32-bit bool, and feed that into the MUX.i32 instruction
<alyssa>
but actually, I claim that can be just two instructions
<alyssa>
FCMP.v2f16.i1 temp, X, Y
<alyssa>
MUX.v2i16 f, t, temp
<alyssa>
[Correction: `all` should have been `any`, and iand should have been ior]
<alyssa>
[Correction 2: MUX.i32]
<alyssa>
[MUX.i32 = roughly NIR bcsel]
columbarius has joined #dri-devel
<alyssa>
What's happening? The comparison results in a vector of 16-bit booleans packed into a 32-bit register
<alyssa>
That register is truthy if /either/ boolean is true, and falsey (0) if /both/ booleans are false.
<alyssa>
So we don't need an actual ior+zero-extend, we just reinterpret the v2b16 as a b32... but that b32 isn't even valid (0x10001 as a truth value, possibly)
<alyssa>
but it's still ... correct
co1umbarius has quit [Ping timeout: 480 seconds]
<alyssa>
Anyway. These sorts of pathological tricks are making me give up on trying to do optimal boolean handling since that's looking to be AI-hard :-p
<alyssa>
At least the old version I have of the Arm compiler doesn't do that silliness.
<alyssa>
...Cute. They implement b2f32 with i1 as U8_TO_F32, one instruction
<alyssa>
and I assume b2f16 as V2U8_TO_V2F16 which is vectorized, wow! :-p
<alyssa>
So maybe the real answer is "do as much optimization in NIR as we can, then do as much peephole optimization as we can to fuse stuff, and then just fallback to 0/1 booleans which are usually fast."
ngcortes has quit [Remote host closed the connection]
<alyssa>
b2b32 i suppose is just ineg
<alyssa>
b2i* is just zero extend
<gawin>
anholt: also helps, noice
<alyssa>
dschuermann: All of this depends on being able to vectorize comparisons of 16-bit floats, which nir_opt_vectorize can't do in packed vec2 mode...
Company has quit [Quit: Leaving]
alatiera has joined #dri-devel
<jekstrand>
daniels: I like tessellation shaders and all but no.
<jekstrand>
daniels: Also, pretty sure that I wouldn't need to build anything super-funky to get tessellation shaders. I remember the days of building wayland/wayland-protocols/weston/xwayland/whatever and don't want to go back.
<jekstrand>
Why do you think I don't do WSI anymore?
nchery has quit [Ping timeout: 480 seconds]
<airlied>
WSI will come and find you :-P
heat_ has quit []
heat has joined #dri-devel
heat has quit []
heat has joined #dri-devel
camus has joined #dri-devel
camus1 has quit [Remote host closed the connection]
alanc has joined #dri-devel
gawin has quit [Ping timeout: 480 seconds]
camus1 has joined #dri-devel
<ccr>
sounds like a premise for a horror movie/story
fxkamd has quit []
camus has quit [Ping timeout: 480 seconds]
JohnnyonFlame has quit [Ping timeout: 480 seconds]
sagar_ has quit [Remote host closed the connection]
sagar_ has joined #dri-devel
remexre has quit [Remote host closed the connection]
remexre has joined #dri-devel
YuGiOhJCJ has joined #dri-devel
aravind has joined #dri-devel
YuGiOhJCJ has quit [Remote host closed the connection]
YuGiOhJCJ has joined #dri-devel
NiksDev has joined #dri-devel
Duke`` has joined #dri-devel
<jekstrand>
airlied: Yeah, I know. "You can run but you can't hide" and all that. But I can run really fast. :-P
<jekstrand>
Maybe I should to work on wifi... No window-systems there...
sdutt has quit [Remote host closed the connection]
<HdkR>
jekstrand: Aperture windows are a type of window system.
mbrost has quit [Read error: Connection reset by peer]
itoral has joined #dri-devel
<idr>
alyssa: I think the Intel compiler does some optimizations like that in the backend. In the b2f(inot(x)) case, we can use an integer addition that writes a floating point source.
<idr>
It becomes a single instruction.
<idr>
Like, 'add r16F, -r12D, 1D'
thellstrom1 has quit [Remote host closed the connection]
<idr>
I might have some branches that try to do some of those other things, but I don't think any of them produced any interesting results.
thellstrom has joined #dri-devel
idr has quit [Quit: Leaving]
Wally has joined #dri-devel
Duke`` has quit [Ping timeout: 480 seconds]
lemonzest has joined #dri-devel
<krh>
daniels: thanks for the compliment! on the other hand, I think I missed an opportunity for doing decorations in the style of weston-flower
<cworth>
Hey, it's krh and idr (or at least it was idr...). Anyway, long time no see!
<krh>
cworth: hi!
gouchi has joined #dri-devel
gouchi has quit []
tango_ is now known as Guest10331
tango_ has joined #dri-devel
Wally has quit [Remote host closed the connection]
heat has quit [Ping timeout: 480 seconds]
Guest10331 has quit [Ping timeout: 480 seconds]
mlankhorst has joined #dri-devel
<dschuermann>
alyssa: amd doesn't have packed comparisons (but real single bit bools). feel free to add it to the vectorizer, though :P
<dschuermann>
for the algebraic optimizations, try if you find any affected application. I could also have a look if you send me a branch :)
<bbrezillon>
application using vkSetEvent, and the vkCmdWaitEvents commands that wait upon them must not be inside a render pass instance. The event must be set before the vkCmdWaitEvents command is executed.", does that mean any CPU-signaled event should be signaled before the command buffers waiting on it are submitted to the queue?
<bbrezillon>
for the record, the panvk implementation currently assumes all CPU-signaled events are signaled before the submission of command buffers waiting on these events, and this test fails on panvk, but I'm not sure if it's the driver that's at fault or the test
mvlad has joined #dri-devel
tango_ is now known as Guest10340
tango_ has joined #dri-devel
camus has joined #dri-devel
thellstrom has quit [Quit: thellstrom]
thellstrom has joined #dri-devel
pnowack has joined #dri-devel
camus1 has quit [Ping timeout: 480 seconds]
Guest10340 has quit [Ping timeout: 480 seconds]
gawin has joined #dri-devel
camus1 has joined #dri-devel
pnowack has quit [Quit: pnowack]
pnowack has joined #dri-devel
camus has quit [Ping timeout: 480 seconds]
YuGiOhJCJ has quit [Quit: YuGiOhJCJ]
rasterman has joined #dri-devel
Major_Biscuit has joined #dri-devel
MajorBiscuit has quit [Ping timeout: 480 seconds]
<bnieuwenhuizen>
bbrezillon: the existence of this VU suggests that assumption is wrong: "If pEvents includes one or more events that will be signaled by vkSetEvent after commandBuffer has been submitted to a queue, then vkCmdWaitEvents must not be called inside a render pass instance"
<bnieuwenhuizen>
(on vkCmdWaitEvents)
<bnieuwenhuizen>
I do vaguely remember the behavior being tightened a bit at one point to avoid enqueuing multiple cmdbuffers before setting an event for the first cmdbuffer, but not sure offhand how it was tightened
agd5f has quit [Ping timeout: 480 seconds]
<bbrezillon>
bnieuwenhuizen: ok, I guess I mistaken 'executed' for 'queued' in the "The event must be set before vkCmdWaitEvents command is executed" sentence
<bbrezillon>
but that also greatly complexifies the implementation since there's no native support for GPU-side memory-value polling in Midgard/Bifrost GPUs
<bnieuwenhuizen>
hmm, missed that one
<bnieuwenhuizen>
kinda hard to see how an app can reliably trigger it before execution but after queueing
<bnieuwenhuizen>
bbrezillon: where is that in the spec?
<bbrezillon>
I thought we could use syncobjs (what the current implementation does), but that doesn't work unless we instantiate a background thread to delay queuing until the event is signaled/set
<bnieuwenhuizen>
bbrezillon: ok, then I suspect the test is wrong now
flacks has quit [Quit: Quitter]
flacks has joined #dri-devel
<bbrezillon>
bnieuwenhuizen: ok. The wording is a bit ambiguous though, and the fact that the deqp test delays the event signaling on purpose led me to think that this was a valid case, but maybe this part of the spec was added afterwards
<bbrezillon>
but it seems to be backported to all spec revisions
<cwabbott>
bbrezillon: aren't syncobjs a kernel-level sort of thing? so are you splitting up submissions with events in them? that would be bad
<bbrezillon>
yep
<cwabbott>
events are supposed to be used as a more flexible alternative to pipeline barriers
<cwabbott>
so inta-command-buffer signalling
<bbrezillon>
and that's certainly inefficient
<bnieuwenhuizen>
ah my guess would've been that you track which events are set in the same cmdbuffer and only use syncobj corss-cmdbuffer
<cwabbott>
I think on mali events are "supposed" to map to an empty job
<bbrezillon>
empty job?
<cwabbott>
just a job that does nothing but adds dependencies
<bbrezillon>
I mean, yes, it can serve as a dependency
<bbrezillon>
but then you need to live patch the job to CPU-signal the event
<cwabbott>
but by the time the job "executes" it's already been signalled
<bbrezillon>
I was considering adding a compute job that polls a value in memory
<cwabbott>
not 100% sure that works
<bbrezillon>
oh, ok, so that doesn't really address the CPU-signaling issue I was reporting, but it make things more efficient by avoid a job chain split
<bbrezillon>
*by avoiding
<cwabbott>
but at least for simple things where the signalling and waiting happens in the same command buffer, I think letting the JS handle the dependencies directly would be best
<cwabbott>
you might need to fall back to the compute shader if it's not signalled earlier in the same command buffer
<bbrezillon>
absolutely, and we'll probably revise the implementation when we get a more or less functional 1.0 vulkan driver, but we wanted to keep things simple at first
<dj-death>
bbrezillon: you have only one VkQueue in your implementation?
<bbrezillon>
yes
<cwabbott>
just be aware that you'll probably have a bunch of infrastructure for splitting up submissions that you're going to need to nuke
<cwabbott>
and any time spent fixing bugs etc. there is going to be wasted time
<dj-death>
bbrezillon: cool, so you don't have to worry about vkCmdSetEvent() ;)
<cwabbott>
I'd rather go with the compute shader thing, since that's simpler and you're going to need it as a backup even with the more-optimal solution so it's not throw-away work
<dj-death>
if your HW supports preemption, you can rely on that to have some kind of progression of other work
<bbrezillon>
cwabbott: yeah, I'm well aware of that, on the other hand, we already need to split batches when there's too many jobs in the cmd buffer (16k jobs IIRC), so part of the batch splitting logic is needed anyway
<bbrezillon>
and I suspect we need to keep syncobjs for event wait/set happening at the batch boundary anyway
<cwabbott>
events explicitly aren't supposed to map to syncobjs
<bbrezillon>
dj-death: it does, it's just not hooked-up kernel side
<cwabbott>
semaphores are the vulkan equivalent for syncobjs
<dj-death>
and fences
<bbrezillon>
right, events are supposed to be signaled before the job is scheduled
<cwabbott>
ah yeah, fences map to syncobjs too
<dj-death>
bbrezillon: I don't think the events are supposed to be signaled before the job runs
<dj-death>
can be later
<cwabbott>
also, I thought you'd only submit 1 job per command buffer
<cwabbott>
not sure why you'd need to split it up
<dj-death>
bbrezillon: which is why you would need to spin in a compute shader
<bbrezillon>
well, yes, one vertex/fragment job pair, that's the case most of the time, but the amount of jobs per chain is limited
<cwabbott>
a command buffer should compile into a linked list of jobs
<bbrezillon>
by the job id field
<cwabbott>
oh, right, ok
<cwabbott>
too bad :(
dreda has quit [Quit: Reconnecting]
<bbrezillon>
and there's already one deqp test that reaches the limit
dreda has joined #dri-devel
<cwabbott>
fun fun
dreda has quit []
dreda has joined #dri-devel
<bbrezillon>
dj-death: yeah, I meant, for anything that's GPU-signaled, it should happen before the job is scheduled
dreda is now known as Guest10360
<dj-death>
bbrezillon: yeah, depends on how many level of scheduling you have in HW ;)
<bbrezillon>
1 :)
<dj-death>
bbrezillon: if it's just the kernel, yep
<bbrezillon>
so, yes, in my case it's simple
<dj-death>
ish ;)
<cwabbott>
anyway, on every other HW impl an event is just a piece of memory which is written to and polled, so if you can do something like that you'll probably have way less breaking other people's expectations
Guest10360 is now known as dreda
<cwabbott>
if you can figure out what command signals the event, you can just add it as a dependency instead of polling, of course
<dj-death>
on our HW the command streamer read the value in memory and if it's not one, just raises an interrupt to tell i915 (now GuC) to say it's blocked
<bbrezillon>
that's what I wanted to do, but then I realized it could monopolize a GPU core for an unbounded amount of time (which can be bounded if we add some sort of timeout in the compute shader), and I thought we could avoid that
<bbrezillon>
tomeu: ^
<dj-death>
then that scheduler can do something else
<bbrezillon>
cwabbott: I can write a value with a WRITE_VALUE job (so no compute shader needed for CmdSetEvent())
<cwabbott>
right
<bbrezillon>
it's the polling that's problematic
<dj-death>
HSW is the HW were we would need to spin, except we can't really preempt (iirc) so it sucks :)
<cwabbott>
I think the best you can do there is the compute shader thing
<cwabbott>
the point of an event is to act like a "separated barrier" where the waiting for earlier commands and signalling later commands is split up, so you can run other things in between
<cwabbott>
but I don't think you'd run more than one or two things in between
<cwabbott>
so ping-ponging to the kernel might really suck
<bbrezillon>
yeah, I think that's part of the debate we had with tomeu back when he added event support to panvk
<bbrezillon>
none of us were entirely happy with the syncobj approach, but it seemed simple enough to give it a try and get this feature implemented, knowing we'd have to rework it at some point
<bbrezillon>
oh and I was worried about the compute job solution monopolizing a core, I admit. But given we already have a timeout for the entire job chain, maybe it's not such a big deal
<cwabbott>
dj-death: is the scheduler really capable of running other stuff in parallel with the command buffer with the event? otherwise that would be pointless
<dj-death>
cwabbott: yeah, on gen8+ the scheduler just swaps the current context for something else (if available)
<cwabbott>
so you can have multiple contexts active at the same time?
<cwabbott>
you can only wait after you signal events, so the current context is always doing something while you wait for the event
<dj-death>
we only have a single context active at a time atm
<cwabbott>
so then it's pointless
<dj-death>
because there is only one engine
<dj-death>
you can always put a different application on the GPU while the current one is blocked on an unsignaled event
<cwabbott>
you can't though, because by the time the current context is done and you can switch to the next context your event is already signalled and its pointless
<cwabbott>
you're not actually saving anything
<dj-death>
I don't quite follow :(
<dj-death>
if the VkEvent is going to be signaled from the CPU right after a sleep(10); call
<dj-death>
you've got 10s of GPU time that can be used for anything else
<cwabbott>
ah, but the event has to be signalled before the command executes
<cwabbott>
if it's signalled on the CPU
lemonzest has quit [Quit: WeeChat 3.3]
<cwabbott>
as per the text bbrezillon pointed out
<cwabbott>
because it explicitly isn't designed for that use-case
lanodan_ has joined #dri-devel
<dj-death>
cwabbott: I'm not sure this is right
<dj-death>
cwabbott: just because of this VU :
<dj-death>
"If pEvents includes one or more events that will be signaled by vkSetEvent after commandBuffer has been submitted to a queue, then vkCmdWaitEvents must not be called inside a render pass instance
<dj-death>
"
<bnieuwenhuizen>
dj-death: might be stale, because as pointed out the requirement about being signalled on the CPu before execuation was added later
lanodan has quit [Ping timeout: 480 seconds]
lemonzest has joined #dri-devel
<dj-death>
I thought that only applied to GPU signaled events
<bnieuwenhuizen>
"Such events must be signaled by the application using vkSetEvent, and the vkCmdWaitEvents commands that wait upon them must not be inside a render pass instance. The event must be set before the vkCmdWaitEvents command is executed."
<bnieuwenhuizen>
it talking about vkSetEvent makes it about CPU signaled in my mind
nchery has joined #dri-devel
<bnieuwenhuizen>
(especially with the earlier sentence "Command buffers in the submission can include vkCmdWaitEvents commands that wait on events that will not be signaled by earlier commands in the queue.")
<dj-death>
I don't know :)
<dj-death>
the test bbrezillon pointed to seems to do exactly that
<dj-death>
(signal from host after submission)
<cwabbott>
the test author maybe made the same mistake you did, or the test was added before that spec text was added
<cwabbott>
I think "split pipeline barrier" is what events were always intended to be, but I guess at the beginning the spec wasn't super-clear on what was allowed
* dj-death
files a khronos issue
<bnieuwenhuizen>
yeah I think the spec was changed because before the spec-change the spec allowed submitting an unlimited number of cmdbuffers before signalling the event
<bbrezillon>
cwabbott: correction, the maximum number of jobs per chain is 64k-1 (16bit field), not 16k, given a draw is formed by a vertex+tiler job pair, it's limited to 32k draws
<bbrezillon>
anyway, I'll add this vkevent rework to my TODO-list
<bbrezillon>
bnieuwenhuizen, cwabbott, dj-death: thanks for chiming in
<dj-death>
5~sure
<dj-death>
issue 2971 in khronos if anybody's interested
Peste_Bubonica has joined #dri-devel
itoral has quit [Remote host closed the connection]
mwalle has joined #dri-devel
mlankhorst has quit [Remote host closed the connection]
<jenatali>
What's even the point of having CPU signals for VkEvents if they need to be signaled before the command buffer is submitted?
<jenatali>
Or, I guess if it's before the command is executed, but I don't see how you could CPU interop such that the command buffer starts executing, the CPU follows along, and signals the event before the wait is reached... but still what's the point, the wait will always be a no-op if it's for a CPU-signaled event
<dj-death>
hopefully that'll be clarified next week ;)
<dj-death>
I don't find much point either :(
sdutt has joined #dri-devel
<bnieuwenhuizen>
jenatali: I guess the thing that is still enabled by this is having an event that can either be set by the GPU or the CPU
<bnieuwenhuizen>
the GPU case obviously being the useful one, but if you don't want to do e.g. predecssor work at some point you can set it from the CPU
<bnieuwenhuizen>
pretty contrived of course
<jenatali>
Oh, I see. The ability to record a command buffer with a wait without having to care whether there will be a signaler submitted before it. If not, the CPU can just signal it to make it a no-op. I see
<alyssa>
dschuermann: Alright, I assumed fp16vec2 meant packed comparisons. Probably trivial to add to the vectorizer but that means /more/ flags. So maybe should wait until the generic scalarization callbacks are a thing..
<bnieuwenhuizen>
yeah. Pretty sure it is not what was originally envisioned though :)
Peste_Bubonica has quit [Quit: Leaving]
nchery has quit [Ping timeout: 480 seconds]
aravind has quit []
nchery has joined #dri-devel
<dschuermann>
alyssa: with the callback function, it would be trivial. I'll see to update the MR
gawin has quit [Ping timeout: 480 seconds]
<alyssa>
dschuermann: ack
<alyssa>
in theory Mali also wants int8vec4
Daanct12 has joined #dri-devel
<alyssa>
which again trivial with callback but not otherwise
<alyssa>
(and that has all the same issues with OpenCL -- 8-bit vec16 should be partially scalarized to 4 vec4 ops)
fxkamd has joined #dri-devel
ella-0 has joined #dri-devel
ella-0_ has quit [Remote host closed the connection]
Danct12 has quit [Ping timeout: 480 seconds]
Daaanct12 has joined #dri-devel
Daaanct12 has quit [Remote host closed the connection]
Daaanct12 has joined #dri-devel
sdutt has quit []
sdutt has joined #dri-devel
Daanct12 has quit [Ping timeout: 480 seconds]
nchery has quit [Ping timeout: 480 seconds]
nchery has joined #dri-devel
Haaninjo has joined #dri-devel
remexre has quit [Ping timeout: 480 seconds]
bluebugs has quit [Read error: Connection reset by peer]
bluebugs has joined #dri-devel
<dschuermann>
alyssa: regarding swizzles outside of vec width (e.g. iadd .xz, .xz): I'm not entirely sure if I should just avoid creating these or also care about lowering if they already exist
<dschuermann>
I did encounter some iadd .xzyy thing. so, when lowering I could either create them temporarily, and lower further (which has the advantage of also catching the case if it already exists in source)
mbrost has joined #dri-devel
<dschuermann>
or I could just avoid to create them which has the advantage of potentially better vectorization (e.g. iadd .xzzx could keep the middle channels, but that probably never happens)
karolherbst has quit [Remote host closed the connection]
Danct12 has joined #dri-devel
gawin has joined #dri-devel
<jekstrand>
bbrezillon: That's an espeically thorny corner.
<jekstrand>
Early on, there were CTS tests that would vkQueueSubmit(), wait 1s or so, and then vkSetEvent(). Those have been deleted from the CTS.
<jekstrand>
These days, I believe it's the responsibility of the client to call vkSetEvent() before the vkCmdWaitEvents() executes. So you can vkQueueSubmit() with a wait on a timeline semaphore and then vkSetEvent() followed by vkSignalSemaphore() to kick it off but that's the only sort of wait-before-signal you can get.
<jekstrand>
Sorry if that was already covered. Didn't read the full scroll-back.
* jekstrand
heads out to pick up his groceries.
<bnieuwenhuizen>
jekstrand: the spec change was made but CTS test not removed AFAIU
Daanct12 has quit [Ping timeout: 480 seconds]
<airlied>
bnieuwenhuizen: b45f4268074897cb4ee7da4b81a17b310301d77b removed some
Major_Biscuit has quit [Ping timeout: 480 seconds]
iive has joined #dri-devel
<dschuermann>
alyssa the lowering passes revisit the lowered instructions before they continue
<alyssa>
anholt: I don't understand TGSI enough to feel comfortable reviewing, sorry :(
<alyssa>
I think I agree doing RA with tgsi regs and not NIR defs is the right approach, so R-b on the NIR delete if nothing else 😅
ybogdano has joined #dri-devel
Major_Biscuit has joined #dri-devel
rpigott has quit [Read error: Connection reset by peer]
rpigott has joined #dri-devel
Major_Biscuit has quit [Ping timeout: 480 seconds]
idr has joined #dri-devel
<austriancoder>
anholt: I have updated my piglit MR. The plans is land it, update piglit in mesa and rebase my arb_shadow stuff on top of it to update the CI passes.
gawin has joined #dri-devel
<anholt>
austriancoder: \o/
xlei_ has joined #dri-devel
mattrope has joined #dri-devel
xlei_ has left #dri-devel [#dri-devel]
<alyssa>
dschuermann: thanks for the patch, here's more work to do :-p
xlei_ has joined #dri-devel
xlei is now known as Guest10400
xlei_ is now known as xlei
Guest10400 has quit [Ping timeout: 480 seconds]
camus has joined #dri-devel
camus1 has quit [Read error: Connection reset by peer]
nchery has quit [Ping timeout: 480 seconds]
nchery has joined #dri-devel
camus1 has joined #dri-devel
camus has quit [Read error: Connection reset by peer]
sagar_ has quit [Ping timeout: 480 seconds]
pnowack has quit [Quit: pnowack]
sagar_ has joined #dri-devel
<alyssa>
anholt: dEQP-GLES31.functional.ssbo.layout.random.all_shared_buffer.36 is so cursed :v
<anholt>
would sure be nice if we could improve its perf some more
<alyssa>
IIRC that hammers RA
<jenatali>
Ugh I'm getting what I think is coordinate space mismatches with tess between GL and D3D but I can't quite figure it out
<jenatali>
For some reason in the quad tessellation shader test I'm getting irregular-sized quads
heat has joined #dri-devel
<anholt>
alyssa: wait. I'm not getting slow results on running it singly on radeonsi, crocus, or virgl right now.
<alyssa>
"Limit dimensionality of arrays-of-arrays in random SSBO tests"
<alyssa>
Yeah that'll do it
<alyssa>
anholt: I am conflicted on that commit
<alyssa>
On one hand, that should shave some time off our deqp-gles31 runs in CI
<alyssa>
On the other hand, that test case is the best RA torture test in deqp-gles
<anholt>
spirv_ids_abuse is pretty good too.
<alyssa>
that's vk
<anholt>
well, get on it.
<anholt>
:D
* anholt
says the person who's been yakshaving fixing ancient gl drivers for a few weeks.
pnowack has joined #dri-devel
flto has quit [Quit: Leaving]
ngcortes has joined #dri-devel
LexSfX has quit []
<alyssa>
remind me what the goal of the delete tgsi effort was again?
<alyssa>
"deleting tgsi"
<alyssa>
ah right of course
LexSfX has joined #dri-devel
ppascher has joined #dri-devel
mvlad has quit [Remote host closed the connection]
flto has joined #dri-devel
<gawin>
anholt: if some opcodes don't like unused channels (for example "if tmp[0].x___" ), then it's best to handle this at first step (blacklist at pass which is generating "unused") or last step when feeding to hardware?
<anholt>
gawin: I don't follow what you mean there
<gawin>
"if tmp[0].xxxx" is ok, "if tmp[0].x___" is not
<anholt>
I would think that one would want to have if tmp[0].x___ after TGSI generation, and then expand that first channel out when emitting the actual shader code.
<alyssa>
^^
<alyssa>
if (vector) doesn't make any sense
mbrost has quit [Ping timeout: 480 seconds]
<alyssa>
idr: much less ridiculous, I wonder if ineg(b2iN) -> b2bN helps anything
<idr>
alyssa: That is another rabbit hole. :)
<idr>
I don't think I actually tried that, but I believe the lack of algebraic optimizations for b32 would be problematic.
<idr>
I suggested some things along those lines at the last f2f XDC, but I was told it was a stupid idea.
<cwabbott>
alyssa: I just tried, and apparently that test takes 0.8 seconds on a release build of freedreno
<cwabbott>
perks of SSA-based register allocation?
<anholt>
cwabbott: which cts are you on?
<cwabbott>
uhh
<cwabbott>
1.2.8
<anholt>
that's got the cts fix that makes it fast :)
<cwabbott>
whoops :9
<cwabbott>
:(
<anholt>
it was slow in the middle of nir, last I remember looking at it
<alyssa>
idr: Ah well yes most of these are stupid ideas I'm just trying to get a feel for what our hw can do
<cwabbott>
was that the one of the ones that hit quadraticness in variable copy propagation?
<anholt>
I don't have an issue filed for it, so I'm thinking it was more of a papercuts
<alyssa>
..isn't copy prop linear in SSA?
<anholt>
but I sure would love to see the copy propagation fixed somehow
<alyssa>
oh, /variable/. right
mbrost has joined #dri-devel
pendingchaos has quit [Ping timeout: 480 seconds]
pendingchaos has joined #dri-devel
gouchi has quit [Remote host closed the connection]
<alyssa>
well.. it depends which side of the branch you were on...
<alyssa>
Maybe the real answer here is swallow my pride and use lower_bools_to_bitsize
<alyssa>
instead of trying to do this in the backend
<alyssa>
Incidentally, flt8/fneu8/fge8/feq8 are nonsensical.
<alyssa>
I mean they're well defined but. Seem to be there mostly by accident
<alyssa>
likewise b8all_fequal
<alyssa>
ugh am I really going to rewrite my entire series, again?
<alyssa>
Maybe. Maybe I am.
<alyssa>
I think it's the sane choice after everything. Sigh
<alyssa>
Oh, and I'm even one of the Reviewed-by tags on the pass! So I can't claim I didn't know better :-p
rasterman has quit [Quit: Gettin' stinky!]
Haaninjo has quit [Quit: Ex-Chat]
Duke`` has quit [Ping timeout: 480 seconds]
heat has quit [Remote host closed the connection]
<austriancoder>
alyssa: that's normal.. from time to time I am not sure how to do things the right way and/or I can not remember that I worked on a topic weeks/months ago until I see my name on the change
<alyssa>
austriancoder: I guess I forgot about lower_bool_to_bitsize because nobody uses it.
<austriancoder>
It feels like being a complete newbee (to everything)
<alyssa>
Ok, technically etna uses it, but I'm pretty sure that was a tpyo and you meant to use lower_bool_to_int32 :-p
<austriancoder>
alyssa: etnaviv uses it.. I think
<alyssa>
does etna support fp16?
<austriancoder>
no
<alyssa>
then you meant to type int32
<alyssa>
though the two passes are identical if you don't support fp16 (yet)
<dschuermann>
alyssa with 1-bit bools that would be entirely aweful... accessing every second bit? or accessing the upper half? we already have to emit a bunch or AND/OR to do the normal boolean phis
<austriancoder>
alyssa: maybe .. it fixed a problem so I used it.
NiksDev has quit [Ping timeout: 480 seconds]
<alyssa>
dschuermann: Nod. lower_bool_to_bitsize chews through it just fine, and has code to handle that case.
<alyssa>
critically, lower_bool_to_bitsize runs before we go out of SSA, whereas anything we do in the backend runs after going out of SSA
<alyssa>
after going out of SSA all bets are off to handle those cases
<alyssa>
(I realize ACO doesn't have this problem :-p)
<dschuermann>
:P
<alyssa>
dschuermann: rude :p
<dschuermann>
you'll want to cover all instructions which NIR lowers eventually in the vectorize_cb as well
<alyssa>
cover?
<dschuermann>
take care that they are kept vectorized
<alyssa>
bifrost IR is 100% vectorized for 8/16-bit
<alyssa>
(scalar instructions in NIR are forcibly replicated)
<alyssa>
it's.. kind of annoying actually
<dschuermann>
yeah, you can remove thazlt stuff after the series is merged
<dschuermann>
that*
<alyssa>
Hmm?
<alyssa>
I'm so lost
<dschuermann>
ok slowly: you will be able to scalarize these instructions in NIR, so you don't have to do that in the backend
<alyssa>
other way around
<alyssa>
Bifrost /does not/ support scalar 16-bit operation
<dschuermann>
ohhh :D
Mooncairn has joined #dri-devel
<alyssa>
16-bit instructions are /always/ vec2
<alyssa>
and if NIR asks for a scalar we just throw away the upper half
<dschuermann>
even transcendentals?
soreau has quit [Remote host closed the connection]
<alyssa>
uhhh
<alyssa>
except transcendentals
* alyssa
whistles
<dschuermann>
yeah, so you can scalarize these ;)
<alyssa>
we already can
soreau has joined #dri-devel
<dschuermann>
not without breaking the vectorization chain
<alyssa>
lower_alu_to_scalar + vectorize filter does it for us
<alyssa>
Ah, well, sure
<alyssa>
so I can expect moar fps with the series then?
<alyssa>
but not any different backend code :)
<dschuermann>
oh, didn't think about that
<dschuermann>
so, you vectorize everything and lower some instructions afterwards?
<alyssa>
scalarize everything and vectorize some afterwards
<alyssa>
(in NIR)
<alyssa>
this is stupid but it's the best we can do until the smart lower_to_scalar is merged
<dschuermann>
yeah, so my experience was that it's better to keep the instructions vectorized
<alyssa>
yes
<alyssa>
hence why I nerdsniped you into the smart lower_to_scalar patches
<alyssa>
and then made you think they were your idea?
<alyssa>
:_p
<alyssa>
("I don't remember that part." "Hmmm...")
<dschuermann>
haha :P
<dschuermann>
you should see the AMD FSR shaders with that series
<dschuermann>
pure gold
Mooncairn has left #dri-devel [Leaving]
<alyssa>
fsr?
<dschuermann>
fidelityFX(tm) upscaling stuff
iive has quit []
<mareko>
does anybody want GL_OVR_multiview?
<HdkR>
According to my logs there has been sparse interest periodically. With the idea that most users have moved to Vulkan
mbrost has quit [Ping timeout: 480 seconds]
pcercuei has quit [Quit: dodo]
stuartsummers has quit []
karolherbst_ has quit [Quit: Konversation terminated!]
<alyssa>
why is fcsel_gt a thing
<alyssa>
Added for r600 I guess
<anholt>
nir_to_tgsi would love to have it, too. but it needs it without icsel_gt :/
<alyssa>
not sure why you wouldn't do that fuse in the backend, though?
<alyssa>
What's the benefit of doing it in NIR?
<alyssa>
I guess opt_algebraic is nice
<anholt>
probably the instruction selection ease. having written comparison-walking code like that several times, it's no fun.
<alyssa>
yeah, I get that
rasterman has joined #dri-devel
<alyssa>
my views on this have shifted over time ... it's inevitable that mature backends write that code themselves since every hw has slightly different instructions available
<alyssa>
(whether manually walking or something like noltis)
<idr>
anholt, alyssa: I have a branch somewhere the splits the float and integer csel cases and adds the missing comparisons.
<alyssa>
but for young backends (or backends for extremely simple hardware), it is quite tempting to just map 1:1 to NIR
<alyssa>
(if "only" NIR supported this one extra instruction.. and this one.. and..)
<anholt>
I'm hoping the ntt RA branch takes some pressure off of the "make nir ops to match all tgsi ops"
<alyssa>
to be fair with the ntt RA branch, ntt becomes a compiler in its own right (as opposed to a NIR translator)
<alyssa>
so if ntt grows an optimizer in the mean time... meh
<alyssa>
dschuermann: convinced me good-enough linear time backend isel doesn't have to be hard or require augmenting the IR if you have SSA
<alyssa>
not saying it's the way to go for ntt, I'm just a lot less scared of not doing everything in NIR
<alyssa>
^than I was when bringing up my first NIR backend