camus1 has quit [Remote host closed the connection]
camus has joined #dri-devel
gouchi has joined #dri-devel
pcercuei has joined #dri-devel
flom84 has joined #dri-devel
i509vcb has quit [Quit: Connection closed for inactivity]
camus has quit []
flom84 has quit [Ping timeout: 480 seconds]
glennk has joined #dri-devel
simon-perretta-img has joined #dri-devel
cdslooef^ has joined #dri-devel
sgruszka has joined #dri-devel
sgruszka has quit [Read error: Connection reset by peer]
kts has joined #dri-devel
rasterman has joined #dri-devel
gouchi has quit [Remote host closed the connection]
Company has joined #dri-devel
kts has quit [Ping timeout: 480 seconds]
mceier has quit [Quit: leaving]
mceier has joined #dri-devel
<jenatali>
karolherbst: I think I just figured out a way to do BDA in dzn, the exact same way that CLOn12 emulates pointers (idx+offset pairs), but instead of index being a locally bound array index, just make it a global resource ID which we already have for descriptor_indexing
<jenatali>
I don't think I would've come up with that if you hadn't suggested rusticl on zink on dzn, so thanks for that
YuGiOhJCJ has quit [Quit: YuGiOhJCJ]
<karolherbst>
jenatali: cool. But would that also work with arbitrary pointers?
<karolherbst>
though _might_ be fine
<karolherbst>
or at least for most things
<karolherbst>
jenatali: I think the only problem you'd need to solve is to keep the idx valid across kernel invocations for things like global variables or funky stuff kenrels might do
<jenatali>
What do you mean arbitrary pointers? When the app asks for a buffer address, I just give back an index and offset
<jenatali>
Yeah, that's what I mean by a global index
<karolherbst>
sure, but applications can do random C nonsense
<karolherbst>
right..
<karolherbst>
yeah.. then it should be fine
<jenatali>
Yeah, it won't be stable for capture/replay but that's a different feature so that's fine
<karolherbst>
so set_global_bindings would return an index and offset packed into 64 bits, I pass this into the kernel via ubo0 (kenrel arguments) and then it should be good to go
<jenatali>
Right
<karolherbst>
and gallium doesn't use load_global(_constant) and store_global for anything, so you can deal with the madness there
neniagh_ has quit []
neniagh has joined #dri-devel
<karolherbst>
I wonder if I want to support different pointer layouts directly, but....
<jenatali>
Well I don't have that bindless path in the gallium driver currently, only in dozen
kts has joined #dri-devel
yyds has quit [Remote host closed the connection]
<karolherbst>
the CL path is really special sadly
<karolherbst>
we have this `set_global_bindings` api which is a bit funky...
<karolherbst>
but that's everything you'd need
<jenatali>
Yeah makes sense
<karolherbst>
luckily there are no bindless images or anything
<karolherbst>
and `set_global_bindings` basically means: give me the GPU address for those pipe resources, and make them available on compute dispatches
<karolherbst>
*for
<karolherbst>
there is also some funky offset business going on, but iris/radeonsi/zink have it correctly implemented
<karolherbst>
jenatali: uhm.. there is another thing: `pipe_grid_info::variable_shared_mem`, no idea if you can support that
<karolherbst>
how are CL local memory kernel parameters currently implemented on your side?
<jenatali>
Only by recompiling shaders
<karolherbst>
mhhh
<jenatali>
Same with local group size because that's a compile-time param in D3D
<karolherbst>
I see, so you have to deal with pain like that already anyway
<jenatali>
Yeah
<karolherbst>
kinda sucks, but not much you or I could do about it...
<jenatali>
karolherbst: btw, I noticed you're computing a dynamic local size by using gcd() with the SIMD (wave) size and the global size. That's always going to return 2 for even global sizes and 1 for odd, since SIMD sizes are powers of 2
<jenatali>
I was looking because CLOn12's handling of odd global dimensions was... Bad
<karolherbst>
yeah...
<karolherbst>
I reworked that code tho, just never landed it as it was part of non uniform workgroup support
<jenatali>
Cool
<karolherbst>
it doesn't matter anyway as most applications aren't silly enough to run into this edge case
<karolherbst>
can you support non uniform work groups?
<karolherbst>
if so.. doesn't matter long term anyway
<jenatali>
Not natively
<karolherbst>
mhhh
<jenatali>
karolherbst: apparently Photoshop does
<karolherbst>
figures...
<jenatali>
At least that's what one of our teams is telling me
<karolherbst>
yeah.. it makes perfect sense if they use image sizes for stuff
<karolherbst>
but uhhh.. why do you think I'm using the simd size with gcd?, I'm using the thread count and the grid size
<karolherbst>
subgroups only as a last ressort if things align really terribly
<karolherbst>
*SIMD size
<karolherbst>
`optimize_local_size` is what I'm looking at
<karolherbst>
so if you have 512 threads and a grid of 500x1x1, you'd get 500x1x1 still
<karolherbst>
it just has some weirdo edge cases where it uses terrible local sizes
<karolherbst>
I don't like the third part of that function and it could be better, but it's not _as_ bad
simon-perretta-img has quit [Ping timeout: 480 seconds]
<jenatali>
Hmm ok, I thought I saw SIMD size in there
simon-perretta-img has joined #dri-devel
<jenatali>
The gcd is still always going to be 2 or 1 though, since that thread count will also be a power of 2
neniagh has quit [Ping timeout: 480 seconds]
simon-perretta-img has quit [Ping timeout: 480 seconds]
simon-perretta-img has joined #dri-devel
neniagh has joined #dri-devel
<karolherbst>
it can be any pot number
<karolherbst>
if your gpu supports 1024 threads, you have 2^10 on one side, and anything else on the other one
<jenatali>
... Yeah that's what I meant
<jenatali>
A power of 2 or 1
<karolherbst>
ahh yeah, fair
<karolherbst>
the last block is supposed to fill it up if the middle one couldn't find a pot of a SIMD size or bigger
<karolherbst>
so if the loop manages to set local to the SIMD size, fine, nothing else to do. I just wanted to prevent sub optimal distribution of threads