ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
guru__ has joined #dri-devel
<HdkR>
ooo, rusticl on Zink. That's a powerful OpenCL driver right there
nchery is now known as Guest318
nchery has joined #dri-devel
<DavidHeidelberg[m]>
I wonder if some highend nvidia with Zink -> rusticl would beat radeonsi in LLM
guru_ has quit [Ping timeout: 480 seconds]
bxkxknxnxmxkbnmm has joined #dri-devel
bxkxknxnxmxkbnmm has quit []
Guest318 has quit [Ping timeout: 480 seconds]
nchery has quit [Quit: Leaving]
bxkxknxnxmxkbnmm has joined #dri-devel
copper has joined #dri-devel
copper is now known as Guest322
Guest322 has quit []
<alyssa>
HdkR: ruzticl! \ o /
<HdkR>
:D
<zmike>
no, no, no, you know the z has to replace a vowel
<airlied>
rustzcl
<idr>
rustizl fo shizzle
benjaminl has quit [Ping timeout: 480 seconds]
<zmike>
let's just shorten to rzl
youmukonpaku133 has joined #dri-devel
youmukonpaku133 has quit []
youmukonpaku133 has joined #dri-devel
<DemiMarie>
Is it just me or is it rather silly for Zink to convert NIR to SPIR-V only for the Vulkan frontend to convert SPIR-V back to NIR again? It seems like there should be some sort of Mesa-only extension that allows short-circuiting this when Mesa is also the Vulkan implementation.
<youmukonpaku133>
that certainly does sound very silly to do
bxkxknxnxmxkbnmm has quit []
<zmike>
it's been a topic of multiple blog posts
<alyssa>
rzstzcl
kzd has joined #dri-devel
<DemiMarie>
zmike: any links?
<youmukonpaku133>
DemiMarie: supergoodcode.com youll find it there
co1umbarius has joined #dri-devel
columbarius has quit [Ping timeout: 480 seconds]
vignesh has joined #dri-devel
<kisak>
DemiMarie: doesn't sound silly to me when game devs can bake zink into their game, and shortly after the NIR running inside the game isn't ABI compatible with the NIR in mesa.
<DemiMarie>
kisak: what about when Zink is built as part of Mesa to support PowerVR? There the ABI _is_ guaranteed to match, so the round trip really is useless.
crcvxc has quit [Read error: Connection reset by peer]
<chaserhkj>
also, can anyone confirm if we really don't have such bypasses in mesa?
bbrezillon has joined #dri-devel
bbrezill1 has quit [Ping timeout: 480 seconds]
ickle_ has joined #dri-devel
ickle has quit [Read error: Connection reset by peer]
yuq825 has joined #dri-devel
Guest279 is now known as nchery
youmukonpaku133 has quit [Read error: Connection reset by peer]
youmukonpaku133 has joined #dri-devel
Haaninjo has joined #dri-devel
kxkamil2 has joined #dri-devel
kxkamil has quit [Ping timeout: 480 seconds]
Haaninjo has quit [Quit: Ex-Chat]
aravind has joined #dri-devel
frankbinns1 has joined #dri-devel
JohnnyonFlame has joined #dri-devel
frankbinns has quit [Ping timeout: 480 seconds]
fab has joined #dri-devel
bgs has quit [Remote host closed the connection]
sgruszka has joined #dri-devel
kzd has quit [Ping timeout: 480 seconds]
sima has joined #dri-devel
bmodem has joined #dri-devel
camus has joined #dri-devel
camus1 has quit [Ping timeout: 480 seconds]
fab has quit [Quit: fab]
Duke`` has joined #dri-devel
mripard has joined #dri-devel
crabbedhaloablut has joined #dri-devel
Company has quit [Quit: Leaving]
pcercuei has joined #dri-devel
alanc has quit [Remote host closed the connection]
alanc has joined #dri-devel
Duke`` has quit [Ping timeout: 480 seconds]
Leopold_ has quit [Ping timeout: 480 seconds]
sghuge has quit [Remote host closed the connection]
sghuge has joined #dri-devel
Duke`` has joined #dri-devel
tristianc67 has joined #dri-devel
tristianc6 has quit [Ping timeout: 480 seconds]
frieder has joined #dri-devel
JohnnyonFlame has quit [Read error: Connection reset by peer]
<cmarcelo>
anyone know if there is a particular reason vkrunner doesn't vendor vulkan headers (like mesa does)? it does make updating make-*.py scripts particularly hard, since they need support an unspecified range of vulkan header versions...
fab has joined #dri-devel
milek7 has quit [Remote host closed the connection]
donaldrobson has joined #dri-devel
<sima>
airlied, can you ping me when you've done the msm pr and backmerge?
<sima>
so I can rebase the drm-ci branch
<sima>
robclark, koike ^^ if you have any updates for result files please incremental patch
lynxeye has joined #dri-devel
swalker__ has joined #dri-devel
swalker_ has joined #dri-devel
hansg has joined #dri-devel
swalker_ is now known as Guest362
milek7 has joined #dri-devel
swalker__ has quit [Ping timeout: 480 seconds]
tristianc670 has joined #dri-devel
milek7 has quit [Remote host closed the connection]
tristianc67 has quit [Ping timeout: 480 seconds]
tristianc6701 has joined #dri-devel
bmodem has quit [Ping timeout: 480 seconds]
dviola has left #dri-devel [WeeChat 4.0.3]
jkrzyszt has joined #dri-devel
dviola has joined #dri-devel
tristianc6704 has joined #dri-devel
youmukonpaku133 has quit [Read error: Connection reset by peer]
youmukonpaku133 has joined #dri-devel
tristianc670 has quit [Ping timeout: 480 seconds]
tristianc6701 has quit [Ping timeout: 480 seconds]
bmodem has joined #dri-devel
youmukonpaku133 has quit [Ping timeout: 480 seconds]
youmukonpaku133 has joined #dri-devel
<airlied>
sima: oh yeah I was going to do that, put it on top for tmrw
cmichael has joined #dri-devel
<eric_engestrom>
alyssa: for dEQP-EGL.functional.create_context.no_config IIRC it's a broken test, it assumes that if any GLES version is supported then all of them must be
<eric_engestrom>
I think there's an issue opened in the khronos tracker already, let me have a look
<eric_engestrom>
as for dEQP-EGL.functional.create_context.rgb888_no_depth_no_stencil it doesn't ring a bell, I'll have a look when I have some time
<alyssa>
eric_engestrom: Aright. How should we proceed then?
milek7 has joined #dri-devel
mripard has quit [Quit: mripard]
Nefsen402 has quit [Remote host closed the connection]
bl4ckb0ne has quit [Remote host closed the connection]
emersion has quit [Remote host closed the connection]
Nefsen402 has joined #dri-devel
bl4ckb0ne has joined #dri-devel
emersion has joined #dri-devel
fab has joined #dri-devel
gawin has joined #dri-devel
YuGiOhJCJ has joined #dri-devel
mripard has joined #dri-devel
<karolherbst>
soooo... nir_op_vec8 and nir_op_vec16 lowering. `nir_lower_alu_width` doesn't lower them on purpose and I was wondering what would be the best idea to deal with those
<karolherbst>
I can also just lower them in rusticl though
<eric_engestrom>
alyssa: for dEQP-EGL.functional.create_context.no_config I don't think there's a way to query which GLES version is supported by a given driver, so I'm inclined to consider the whole test invalid and it should be deleted
<eric_engestrom>
but that's as far as I got when I looked into that a long time ago, maybe I missed something and it's possible to make this test valid
<alyssa>
karolherbst: you shouldn't be getting them at all, if you lower width of all ALU and memory and then run opt_dce
<karolherbst>
yeah.. that's what I thought, but I still have them.. but maybe it's just bad pass ordering... I'll deal with it once I've dealt with other vulkan validation errors :D
<alyssa>
karolherbst: Still have them in what context?
<alyssa>
What is reading them?
<alyssa>
Oh, one other thing, you need to copyprop too
<karolherbst>
yeah.. I think I do all of this, might just be in the wrong order
<karolherbst>
or something weird
<karolherbst>
or maybe something I've done fixed it now... I hav to find a test again which is running into this
<alyssa>
yeah
<alyssa>
This should work
chaserhkj has left #dri-devel [#dri-devel]
Leopold has joined #dri-devel
yyds has quit [Remote host closed the connection]
Duke`` has quit [Ping timeout: 480 seconds]
<zamundaaa[m]>
robclark: btw what happened to the dma fence deadline stuff?
pekkari has joined #dri-devel
bmodem has quit [Ping timeout: 480 seconds]
bmodem has joined #dri-devel
bmodem has quit [Remote host closed the connection]
bmodem has joined #dri-devel
greenjustin has joined #dri-devel
psykose has joined #dri-devel
<Lynne>
DavidHeidelberg[m]: 250 tokens, 26 seconds for rusticl on zink on a 6000 ada, 19 seconds on 6900XTX (rusticl on radeonsi)
greenjustin has quit [Remote host closed the connection]
greenjustin has joined #dri-devel
<austriancoder>
alyssa: which nir pass lowers vec16, vec8 to vec4?
<alyssa>
none
<alyssa>
but if you lower everything that reads vec16 into smaller things, then there's nothing left to write vec16 either
youmukonpaku133 has joined #dri-devel
<austriancoder>
ah..
bmodem has quit [Ping timeout: 480 seconds]
<robclark>
zamundaaa[m]: all the non-uabi stuff landed.. the uabi stuff is just waiting for someone to implement some userspace support
<zamundaaa[m]>
I can take care of that
youmukonpaku133 has quit [Ping timeout: 480 seconds]
youmukonpaku133 has joined #dri-devel
<emersion>
gfxstrand: would VK_EXT_host_image_copy be useful for a wayland compositor?
crcvxc has quit [Read error: Connection reset by peer]
Duke`` has joined #dri-devel
alyssa has left #dri-devel [#dri-devel]
<Lynne>
I can't think of a use for it, besides less blocky image downloads (copy to optimally tiled host image, convert layout on the CPU?)
<emersion>
i think the big win would be to get rid of the staging buffer on setups where that makes sense
<emersion>
this ext looks a lot like OpenGL's "here's a pointer to my host memory, plz do whatever to upload to GPU"
<Lynne>
you can do that already by host mapping
<emersion>
you mean VK_EXT_external_memory_host?
<Lynne>
yes
<Lynne>
you don't even need page size alignment for the source pointer
<emersion>
i never managed to fully understand how this ext works
<Lynne>
alloc a buffer, and allocate memory for it, only instead of allocating, chain a struct to use a host memory pointer
<emersion>
i got some stride alignment issues iirc, and no way to discover the required alignment
<emersion>
or something
<Lynne>
that then allows you to let the GPU copy
<Lynne>
nnnope
<emersion>
would need to check again
<Lynne>
you can workaround all alignment issues
<emersion>
also VK_EXT_external_memory_host might required pinned memory when VK_EXT_host_image_copy doesn't?
<emersion>
not sure
<emersion>
require*
<Lynne>
pinned memory?
<Lynne>
host_image_copy is slower, not sure why you'd want to use it
youmukonpaku133 has quit [Ping timeout: 480 seconds]
<karolherbst>
mhh
<karolherbst>
so the vec8 thing I'm seeing looks like this:
<emersion>
the host memory region can't be evicted from RAM while the driver does the upload, so that makes the kernel unhappy
<emersion>
i'm not sure about the details
<Lynne>
it's a mapping, not a buffer
<Lynne>
it's not even a cached mapping
<emersion>
VK_EXT_external_memory_host is not a mapping, it's "here's my host pointer, please make a VkBuffer from it", no?
youmukonpaku133 has joined #dri-devel
<Lynne>
yes, it's up to how the driver implements it
<Lynne>
as for alignment, like I said, although the extension requires the source pointer is aligned, you can simply pick any arbitrary address nearby which is aligned, then during the copy buffer to image command, you can offset the start point
Company has joined #dri-devel
<Lynne>
both libplacebo and ffmpeg have been doing this for years with no bugs on any platform
<haasn>
yeah it's UB but it works on all platforms I care about
<austriancoder>
karolherbst: I you found a solution.. tell me
<karolherbst>
I mean.. what should copy_prop do in this case?
<emersion>
however
<karolherbst>
it all neatly works once I just scalarize alus
<emersion>
there is no way to communicate the stride when importing
<emersion>
VkImportMemoryHostPointerInfoEXT just has a host pointer and that's it
<austriancoder>
karolherbst: for my vec4 hw I would love to have 4x vec4 instead of the one vec16
<karolherbst>
austriancoder: anyway, I kinda prefer of dealing with all of this in the frontend unless you have other use cases for vec8/16 you can probably start dropping the code
<karolherbst>
yeah...
<karolherbst>
zink also doens't want to scalarize
<Lynne>
emersion: you communicate the stride during the buffer copy command
<karolherbst>
I suspect we'll have to handle vec8/vec16
<Lynne>
same as usual
<austriancoder>
karolherbst: I would love if you can do it in the frontend (rusticl)
<karolherbst>
that should make almost everything work
<emersion>
last time i looked at this i was a vulkan beginner :S
<karolherbst>
there are just a handful of tests where it doesn't
<emersion>
okay, i'll give it another go
<emersion>
Lynne: also, for a wayland compositor which doesn't want to block, offloading host memory copies to a separate thread would make sense right?
<Lynne>
for downloads, which are required to be synchronous, yes, otherwise for uploads, no, it's the GPU doing them
<Lynne>
(not that they're required to be synchronous, it's just that most things you'd like to do with a buffer want it to be fully written already, if you carry around a fence it's not a problem)
<robclark>
zamundaaa[m]: ok, cool.. I expect the uabi patches should still apply cleanly but lmk if you need me to re-post
<zamundaaa[m]>
robclark I'm not 100% sure which patches I need to pick. Do you have a branch somewhere to make it easier to test?
<Lynne>
emersion: if you want three separate versions, but I don't think host_image_copy is popular enough yet, and pretty much everything implements host mapping
<emersion>
yeah
Duke`` has quit [Ping timeout: 480 seconds]
gawin has quit [Ping timeout: 480 seconds]
greenjustin_ has joined #dri-devel
<gfxstrand>
emersion: Which is optimal is going to depend on the hardware and the behavior of the app. If it's changing every frame and you can sample directly from it, VK_EXT_external_memory_host is probably faster. If it's going to sit there for a couple of frames, host copy is probably faster.
<emersion>
gfxstrand: by VK_EXT_external_memory_host, i don't mean sample directly from the host memory VkBuffer
<gfxstrand>
You mean CopyBufferToImage?
<emersion>
there would still be a copy to a GPU buffer
<gfxstrand>
In that case, I'd assume a host image copy is the fastest.
<emersion>
wrap my shm client's buffer with a VK_EXT_external_memory_host VkBuffer, then copy that buffer to the final image
<gfxstrand>
host buffers have all sorts of potential perf problems.
<emersion>
hm, can you elaborate?
<karolherbst>
or maybe we should just ask drivers to deal with nir_op_vec8/16 as otherwise we might risk devectorizing load/stores or something.. dunno.. maybe it's not a big deal at this vector size. Maybe I should just add a flag to nir_lower_alu_width indeed, or let the filter deal with this
<emersion>
is there a way i can use them to avoid the problems?
<karolherbst>
but it's messy
<gfxstrand>
karolherbst: Currently, even fully scalar drivers have to deal with vec8/16
<karolherbst>
yeah...
<karolherbst>
for zink that's just super annoying
<gfxstrand>
You can break it up, it's just tricky.
<karolherbst>
sure
<karolherbst>
but I think `nir_lower_alu_width` is probably the best place for it, no?
<karolherbst>
or a special pass?
<gfxstrand>
IDK
<gfxstrand>
IMO, a Zink-specific pass is probably the way to go
<karolherbst>
yeah.. probably
<gfxstrand>
Something that replaces any read of vec8/16 with creating a subvec.
<karolherbst>
austriancoder: how much pain would it be for you to _only_ deal with nir_op_vec8/16 opcodes and everything else is vec4 at most?
<gfxstrand>
Then copy-prop should delete the vec8/16
<gfxstrand>
karolherbst: Oh, right... there are vec4 GPUs that exist. *sigh*
<karolherbst>
yeah.. it's probably easier to deal with this from the alu op perspective
greenjustin has quit [Ping timeout: 480 seconds]
<karolherbst>
yeah....
<gfxstrand>
So, yeah, maybe make it general but we probably do want to make it special.
<gfxstrand>
For scalar drivers, it's easier to just handle it.
<karolherbst>
I was also thinking about adding a flag to nir_lower_alu_width and then it moves sources into vec4
<karolherbst>
or something
<karolherbst>
like
<karolherbst>
if a alu references a vec8 source, just extract the relevant channels into a vec4 and consume this
<karolherbst>
and then hope every vec8/16 gets removed
<karolherbst>
which it kinda should
<robclark>
zamundaaa[m]: ok, cool, that is just what I was waiting for to re-post..
<karolherbst>
anyway, for scalar it already works
<karolherbst>
or at least lowerinv vec8/16 to vec4 works, because the scalarize things scalarizes stuff :D
<karolherbst>
anyway, will play around with nir_lower_alu_width then
<gfxstrand>
emersion: Couple of things. One is that it will always live in system RAM and have all the access problems that come with that. In theory a CopyBufferToImage *should* be fast but it also means round-tripping to the kernel to create a BO and kicking something to the GPU and tracking that host image buffer etc. The CopyMemoryToImage case should do a WC write from the CPU and there's no extra
crabbedhaloablut has quit [Remote host closed the connection]
<gfxstrand>
resources to track. Both should be able to saturate PCIe, I think. The resulting image should be about as fast, probably.
<austriancoder>
karolherbst: not much .. I think .. but I would spend an hours or so on some nir magic
<karolherbst>
mhh.. yeah...
<karolherbst>
for scalar backends all this stuff is trivial
<gfxstrand>
emersion: So I guess IDK which is going to be faster in an absolute sense. Host copies are certainly easier and involve less tracking and less stuff you have to kick to the GPU.
<gfxstrand>
Probably depends on where your system is loaded.
<karolherbst>
but vectorized things are annoying as you can have random swizzles and that nonsense
phasta has quit [Ping timeout: 480 seconds]
crabbedhaloablut has joined #dri-devel
<austriancoder>
yeah..
<karolherbst>
though a pass to materialize num_component > 4 sources into their own vec4 shouldn't be tooo hard
<karolherbst>
probably pretty straight forward
<karolherbst>
I'll write something up
<gfxstrand>
karolherbst: So the thing I think scalar back-ends want to avoid is having extra vec instructions that get turned into extra movs that it has to copy-prop out later.
<karolherbst>
yeah..
<karolherbst>
this should absolutely be optional
<gfxstrand>
Anything which actually takes a vector, like a texture op or intrinsic will get a vector that's already the right width.
<gfxstrand>
They really exist just for weird ALU swizzle cases.
<karolherbst>
yep
<gfxstrand>
Which I'm now questioning....
<emersion>
gfxstrand: hm if i keep the host pointer VkBuffer around, does that help the kernel not re-create the BO over and over again?
<gfxstrand>
Like, if all your ALU ops are scalar, why can't copy-prop get rid of the vector...
<gfxstrand>
:thinking:
<karolherbst>
it works for scalar
<gfxstrand>
emersion: Yeah. There is weird locking stuff in the kernel, though, and you'll end up paying that cost on every submit, even submits that aren't your copy (thanks, bindless)
<gfxstrand>
emersion: But we're seriously micro-optimizing at that point.
<emersion>
i do get that host memory is more work for the driver/kernel, since it's work i'm not doing anymore :P
<emersion>
ok
<emersion>
i guess i'll just try and make sure we don't regress
<emersion>
and see how it goes on my hw collection
<gfxstrand>
sounds good
<gfxstrand>
But, generally, the host copy path is probably the path you're going to get in a GL driver when you TexSubImage with properly matching formats.
<gfxstrand>
(Where by host copy I mean CopyMemoryToImage)
alyssa has joined #dri-devel
<emersion>
right
<alyssa>
DavidHeidelberg[m]: FWIW, I disabled the T720 job
<alyssa>
It seems that either the T720 devices are down, or they're absolutely overloaded right now, or a combination of the two
<emersion>
that's exactly what my mental model was :P "VK_whatever_gl_does"
<emersion>
gfxstrand: so this host copy path potentially allocates memory?
<alyssa>
marge pipeline had t720 waiting for a device for 45+ minutes without making progress, so I pushed the disable.
<emersion>
or would a driver that needs to allocate memory just not implement the ext?
<alyssa>
that marge pipeline was lost (reassigned, we'll try this again) but no other collateral damage and hopefully the rest of the collabora farm is ok
<gfxstrand>
emersion: No, it shouldn't allocate memory.
<gfxstrand>
Well, it might need to malloc() a bit depending on $DETAILS
<emersion>
ok, good
alyssa has left #dri-devel [#dri-devel]
<gfxstrand>
But it shouldn't need to allocate GPU memory
<emersion>
sure, i meant memory for buffer
<gfxstrand>
And it won't submit anything to the GPU
<emersion>
ah, interesting
<emersion>
right
<emersion>
no VkCommandBuffer
<gfxstrand>
On Intel, it'll be an isl_tiled_memcpy()
<gfxstrand>
Feel free to go read that eldritch horror if you wish. I cannot guarantee your sanity will be intact on the other end, though. :joy
<gfxstrand>
It's even able to use SSE to do BGRA swizzles!
guru_ has joined #dri-devel
<emersion>
lol
<gfxstrand>
I'm hopefull that the NVK implementation will be substantially more sane but that's still TBD.
<gfxstrand>
Some of it is just because tiling is complicated
<gfxstrand>
in general
oneforall2 has quit [Ping timeout: 480 seconds]
gawin has joined #dri-devel
gawin has quit []
guru__ has joined #dri-devel
cmichael has quit [Quit: Leaving]
guru_ has quit [Ping timeout: 480 seconds]
youmukonpaku133 has quit [Ping timeout: 480 seconds]
youmukonpaku133 has joined #dri-devel
junaid has joined #dri-devel
alyssa has joined #dri-devel
<alyssa>
gfxstrand: I'm not sure what to make of host copy for agx or mali
<alyssa>
We have a CPU path for tiled images, but the GL driver almost never uses it
<alyssa>
since most of the time we use compressed images, which are accessed via a GPU blit to/from a linear image
<alyssa>
the tiled CPU path is used for ASTC/BCn and that's usually it
<alyssa>
so I can't think of a situation where host image copy would ever be the right thing to do
<alyssa>
for this class of hw
<alyssa>
CopyBufferToImage (or vice-versa) or maybe BlitImage with a Linear image will typically be the optimal path
<alyssa>
de/compressing on the CPU is a nonstarter
Leopold has quit [Remote host closed the connection]
alyssa has left #dri-devel [#dri-devel]
guru__ has quit [Read error: No route to host]
oneforall2 has joined #dri-devel
Leopold has joined #dri-devel
vliaskov has quit [Read error: No route to host]
<daniels>
DavidHeidelberg[m]: well alyssa is gone, but https://gitlab.freedesktop.org/krz/mesa/-/jobs/47878513 went through ~immediately, and those boards are extremely under-loaded, so I think it's just some weird temporal issue; perhaps a recurrence of the old issue where gitlab just wasn't handing out jobs when it should
<karolherbst>
anyway.. I think it's fair for rusticl to reduce the vecs to vec4 and then we can drop a bunch of vec8/16 code from backends again... or maybe we keep it and drivers can say what they support
<karolherbst>
or
<karolherbst>
drivers can lower it themselves
<karolherbst>
dunno
<karolherbst>
for now I just want something which works
junaid has quit [Remote host closed the connection]
<karolherbst>
and only if the source comes from a vec? mhh...
guru__ has quit [Ping timeout: 480 seconds]
sgruszka has quit [Remote host closed the connection]
guru__ has joined #dri-devel
guru_ has quit [Read error: No route to host]
guru_ has joined #dri-devel
benjamin1 has joined #dri-devel
<karolherbst>
Pass 2350 Fails 68 Crashes 36 Timeouts 0: 100% and 60 of those fails are ignorable images tests with NEAREST filtering... I think that's probably as good as it needs to be to focus on cleaning up that MR :D
<karolherbst>
(that's on ANV btw)
<gfxstrand>
\o/
<gfxstrand>
karolherbst: Ugh... copy lowering is going to be a mess.
<karolherbst>
radv is a bit worse, but ACO doesn't support any of the CL nonsense, so that's not surprising
<gfxstrand>
I think we need a per-stage array of nir_variable_mode which supports indirects.
<gfxstrand>
We have so many things which are currently keying off of misc things
<gfxstrand>
If nir_lower_var_copies were happening entirely in drivers, it'd be okay, but it's not.
<karolherbst>
last radv run was "Pass 2222 Fails 15 Crashes 217 Timeouts 0: 100%", wondering how much that final vec8/16 lowering will help
<karolherbst>
lvp is a bit better
<karolherbst>
anyway
<karolherbst>
good enough on all drivers
<gfxstrand>
NIR lowering has started becomming an eldritch horror
<karolherbst>
yeah....
<karolherbst>
it's a mess
<gfxstrand>
s/started becoming/become/
<gfxstrand>
It happened a long time ago
youmukonpaku133 has quit []
<karolherbst>
I also procrastinated reworking the nir pipeline, because.......
youmukonpaku133 has joined #dri-devel
<karolherbst>
it works and I don't want to break it
<gfxstrand>
At some point, I should do an audit and make some opinionated choices.
<karolherbst>
yeah
<gfxstrand>
But ugh.....
<karolherbst>
that would help
<karolherbst>
:D
<gfxstrand>
There's no way I don't break the universe in that process.
<gfxstrand>
And I'd much rather write NVK code
<karolherbst>
fair enough
guru__ has quit [Ping timeout: 480 seconds]
<karolherbst>
you can just comment on bad ordering and I can figure out the details
<gfxstrand>
Like WTH is st_nir_builtins.c lowering var copies?!?
<karolherbst>
heh
<karolherbst>
:D
<gfxstrand>
It's worse than that
<gfxstrand>
I can't just drop some comments places.
<karolherbst>
there are probably good reasons for it (tm)
<gfxstrand>
Oh, I'm sure there is a reason
<gfxstrand>
There's a reason for every line of code
<karolherbst>
ahh you meant the entire st/mesa stuff?
<gfxstrand>
Whether or not it's a good reason is TBD.
<gfxstrand>
Oh, I mean EVERYTHING
<karolherbst>
uhhh
<gfxstrand>
Vulkan, OpenCL, st/mesa, various back-ends. The lot.
<karolherbst>
maybe... work on NAK instead and hope we can just drop gl drivers in the future?
<zmike>
<gfxstrand> There's a reason for every line of code
<karolherbst>
or rather.. some stages of nir where certain things are allowed/forbidden
<karolherbst>
but uhhh
<karolherbst>
no idea really
<gfxstrand>
I think the #1 thing we need is someone with the appropriate credentials to be opinionated.
<gfxstrand>
Which basically boils down to that.
<gfxstrand>
We may also want some helpers to assert said opinions.
<gfxstrand>
I've mostly looked the other way while the st/mesa stuff was built because, well, Ew, GL.
<karolherbst>
fair
<gfxstrand>
That was maybe a mistake.
<gfxstrand>
In any case, I think it's well past time for a cleanup. Sadly, there are very few people who really understand all of NIR and the various APIs in play and back-end needs well enough to do that clean-up.
<gfxstrand>
And by "very few" I mean "me, maybe". :joy:
<karolherbst>
oh wow.. my vec8/16 patch really helped a lot with lvp... from ~200 carshes down to ~50... but a bunch of fails, but nothing serious
<karolherbst>
so I guess what I have kinda works now...
<karolherbst>
uhhh
<karolherbst>
pain
<karolherbst>
I wished it wouldn't be so hakcy
<daniels>
DavidHeidelberg[m]: huh, is CelShading new on T860? it hangs reliably on your MR …
<Lynne>
airlied: could you push !24572?
lynxeye has quit [Quit: Leaving.]
junaid has joined #dri-devel
youmukonpaku133 has quit [Ping timeout: 480 seconds]
<karolherbst>
I should probably check if it runs on nvidias driver by now...
<karolherbst>
uhhh.. there are things like ieq %44.abcd, %45 (0x0).xxxx what my code doesn't account for.. pain
<karolherbst>
that was easy to fix
<anarsoul>
what does GL spec say on empty fragment shader, i.e. void main(void) {}, without actually writing any outputs?
<gfxstrand>
I think you get undefined garbage
<Lynne>
speaking of rusticl on nvidia on zink, it runs but it generates junk
<karolherbst>
Lynne: try my most recent rusticl/tmp branch and run with RUSTICL_DEBUG=sync
<karolherbst>
I won't promise it will work, but it should work better
heat has joined #dri-devel
<karolherbst>
something is up with the fencing with zink and rusticl and I stll have to figure that out
<karolherbst>
but that _might_ be enough to run luxball or something
<karolherbst>
Lynne: but is it faster than nvidia's CL stack
<anarsoul>
shaders/tesseract/270.shader_test has a FS with an empty body
<Lynne>
what was the envvar to select an opencl device again?
<karolherbst>
uhhh...
<zmike>
anarsoul: not sure exactly what you're asking
<zmike>
it just does nothing
<karolherbst>
you can normally do that in apps
<anarsoul>
zmike: I'm asking what is expected result of it. gfxstrand already replied that it's undefined
<zmike>
you mean the result in the framebuffer?
<anarsoul>
yeah
<karolherbst>
do we have a helper to extract a const channel used by an alu instruction?
<Lynne>
llama.cpp just picks the first device afaik
<karolherbst>
ehh.. maybe I just use nir_build_imm
<zmike>
anarsoul: "Any colors, or color components, associated with a fragment that are not written by the fragment shader are undefined."
<zmike>
from 4.6 compat spec
<zmike>
but this is in the compat language section
<anarsoul>
zmike: thanks!
<Lynne>
found a workaround, yes, nvidia's opencl stack is faster
<karolherbst>
not the answer I wanted to hear :P
<Lynne>
around 13 times faster
<karolherbst>
uhh
<karolherbst>
annoying
<karolherbst>
that's quite a bit
<karolherbst>
which benchmark did you use? luxball?
<Lynne>
llama.cpp
<karolherbst>
ahh mayb that's expected
<Lynne>
I'll try ffmpeg's nlmeans filter
<karolherbst>
nah, most of those benchmarks actually test something else
<karolherbst>
the only I know is realibly testing kernel execution speed is luxmark
<karolherbst>
the API overhead is..... significant if you do debug builds and it's not very optimized anyway
<karolherbst>
mostly interested in raw kernel execution speed, because that's probably the hardest part to get really performant
<karolherbst>
and I also don't know if other benchmarks actually validate the result or not
<DavidHeidelberg[m]>
<daniels> "David Heidelberg: huh, is..." <- Old, very old. Sonething is wrong.
<Lynne>
okay, ffmpeg's nlmeans_opencl filter is a little bit faster on zink on nvidia!
<karolherbst>
yeah, but is the result correct?
<karolherbst>
zmike: ^^ you know what to do
<zmike>
MICHAEL
<zmike>
MICHAEL WHERE IS THE POST
<karolherbst>
but being faster with luxball would be impressive
<karolherbst>
because that's actually 100% kernel throughput
<karolherbst>
and no funny business
<karolherbst>
it's basically benchmarking the compiler
<Lynne>
looks correct
<karolherbst>
the issue with llama.cpp is that it probably uses fp16 and other funky extensions we don't support, and probably matrix stuff and.. uhh.. cuda on nvidia
<karolherbst>
so dunno if that's even remotely compareable
<karolherbst>
also.. it's heavily optimized for nvidia
<karolherbst>
Lynne: nice :)
<karolherbst>
okay.. I'm getting close ...
heat_ has joined #dri-devel
<karolherbst>
maybe official zink conformance this year would be possible :D
<Lynne>
nah, llama.cpp's cl code is pretty meh, pretty much it all is except cuda
<karolherbst>
fair
heat has quit [Remote host closed the connection]
<karolherbst>
but did you compare zink vs cuda or cl?
<Lynne>
I should try fp16 though, is it supported on zink?
<karolherbst>
no
gouchi has joined #dri-devel
<karolherbst>
it's not supported in rusticl at all atm
<karolherbst>
for.... annoying reasons
<Lynne>
I meant through the features flag?
<karolherbst>
"Pass 2374 Fails 69 Crashes 11 Timeouts 0: 100%" getting close (and again, like 60 fails won't matter)
<Lynne>
correctness is another chapter
<zmike>
fp16 is supported in zink
<karolherbst>
RUSTICL_FEATURES=fp16
Duke`` has joined #dri-devel
<karolherbst>
I'll probably test it on my GPU later this week :D but for now I want it to work on mesa drivers
<Lynne>
fp16 output is correct, but it's not any faster
<karolherbst>
for conformance over zink, would 3 mesa drivers count or do I have to use different vendors?
<karolherbst>
annoying
<karolherbst>
maybe it's not using it? maybe it's broken? maybe all of it? I have no idea :)
<zmike>
you'd need to see what the conformance requirements are for CL
<karolherbst>
it's mostly blocked on libclc not supporting it
<zmike>
in GL it's 2
<Lynne>
ggml_opencl: device FP16 support: true
<Lynne>
it is using it
<karolherbst>
ahh
<karolherbst>
I suspect the slowness comes from something else
<karolherbst>
like the API
<karolherbst>
rusticl still is doing a few silly things
<zmike>
better drop everything and debug this karol
<karolherbst>
but that lama.cpp actually works is kinda neat
<zmike>
your p-score is plummeting
<karolherbst>
yeah I know
<karolherbst>
that's why I want a higher score on luxball than nvidia
<zmike>
absolute freefall
<Lynne>
out for blood, eh? good luck
<karolherbst>
zmike: right.. but it can't be like 2 mesa drivers, can it?
<zmike>
if it's a hw implementation then the vendor has to have submitted a conformant driver
<zmike>
so like zink+anv would need a conformant intel GL driver
<zmike>
from the company itself
<zmike>
(iris or proprietary in this case since both are owned by intel)
<karolherbst>
sure
<zmike>
so yes, it could be mesa
<karolherbst>
but would zink+anv and zink+radv be enough or would you have to use a different vk driver
<zmike>
if it meets the CL requirements it'd be fine
<karolherbst>
okay
<zmike>
I only know the GL ones
<zmike>
just mail neil and ask, not sure there's been many conformant layered impls
heat_ has quit [Read error: Connection reset by peer]
<zmike>
or whoever the CL chair is
heat has joined #dri-devel
<karolherbst>
mhhh
<karolherbst>
for CL they don't specify much
<karolherbst>
I can ask on the next meeting though
<karolherbst>
but yeah, neil still is the chair
<karolherbst>
ohh wait, they say two
<karolherbst>
"For layered implementations, for each OS, there must be Successful Submissions using at least two (if
<karolherbst>
available) independent implementations of the underlying API from different vendors (if available), at
<karolherbst>
least one of which must be hardware accelerated."
<zmike>
sounds like same as GL
<karolherbst>
yeah, but different vendors :)
<karolherbst>
so two mesa drivers won't cut it
<karolherbst>
I guess?
<zmike>
pretty sure they will
<karolherbst>
dunno
<karolherbst>
I can ask
<zmike>
but again, ask neil
<karolherbst>
anyway... I should upstream some of the patches, we can leave the vec8/16 lowering out a bit, because that's going to be pain to get right...
<karolherbst>
but seems like my current attempt works pretty well
An0num0us has joined #dri-devel
<airlied>
Lynne: was waiting for Benjamin to update the tags, but I suppose I can do that
<gfxstrand>
Modern AMD/NV/Intel hardware really is built for userspace submit
<gfxstrand>
Even modern Mali tries
<DemiMarie>
And from my (Qubes) perspective userpace submit is absolutely awful
<gfxstrand>
Eh, yes and no.
<gfxstrand>
userspace submit doesn't really increase the surface that much
<DemiMarie>
Can you elaborate?
<DemiMarie>
Because this is something that has worried me for quite a while.
<gfxstrand>
The moment you turn on the GPU, you have a massive pile of hardware that's potentially an attack surface.
<gfxstrand>
The thing that saves you is page tables. The kernel controls those and every bit of GPU memory access goes through them.
<DemiMarie>
So basically the same as a CPU?
<Lynne>
airlied: thanks, the ffmpeg side of the patch fixes nvidia, but I'd rather have at least one fully working implementation at any one time
<gfxstrand>
DemiMarie: Pretty much.
<DemiMarie>
gfxstrand: from my perspective, the big difference is how much of the GPU firmware is exposed to untrusted input
<gfxstrand>
Yeah, so that's the thing. How much untrusted input are we talking about? On something like NVIDIA, it's just a bunch of 64b pushbuf packets with maybe another type of packet for signaling a semaphore.
<gfxstrand>
Those themselves are probably pusbufs (I'm not actually sure)
<DemiMarie>
Interesting
<gfxstrand>
On Intel, it's the same command streamer that is used for processing graphics commands. It just has a 3 level call/return stack.
<gfxstrand>
You can put 3D rendering commands in level0 if you want.
<gfxstrand>
IDK how AMD's works.
<gfxstrand>
The only additional thing you're really exposing is the doorbells.
<DemiMarie>
Does that mean that on Intel there is no difference at all attack surface wise?
<DemiMarie>
Doorbells are actually a problem for virtualization.
<gfxstrand>
Well, it depends a bit on the doorbell situation.
<gfxstrand>
If all you expose is the ring itself and you leave the kernel in charge of doorbells, then I don't think the attack surface is different.
<DemiMarie>
You can’t expose doorbells (or any MMIO) to untrusted guests due to hardware bugs on Intel
<gfxstrand>
Sure
<gfxstrand>
But by "doorbells" I think I mean a regular page of RAM.
<DemiMarie>
Specifically to fully untrusted guests — a guest with access to MMIO can hang the host
<gfxstrand>
There's details here I'm missing.
<gfxstrand>
If you expose doorbells (I don't remember most of the details), then you're opening up a bit more surface but probably not much.
<DemiMarie>
How could it be a regular page of RAM? Does the firmware just poll for input on its various queues without relying on interrupts?
<gfxstrand>
I think so
<gfxstrand>
But, again, I don't remember all the details.
<DemiMarie>
That’s the other part that is potentially problematic, where by “problematic” I mean “scary”
<DemiMarie>
The GPU drivers are open source software and at least some of them (Intel IIRC) are being fuzzed continuously by Google for ChromeOS.
<DemiMarie>
So while I can’t guarantee that it is perfect, it likely at least has no blatently obvious flaws.
<gfxstrand>
Where Intel gets scary is the FW interop queues. That's where you can send actual commands to the firmware to do various things. IDK if those are ever expected to be exposable to userspace or if they're always privileged. (Details of GuC interaction are not my strong suit.) If they are, then that's an actual attack surface. If all that is marshalled through the kernel or hypervisor, then I think
<gfxstrand>
you're fine.
<DemiMarie>
Are they performance critical?
<DemiMarie>
Hopefully they are always privileged.
<gfxstrand>
No, I don't think they're perf critical.
<DemiMarie>
Then keeping them privileged should be fine.
<sima>
the guc queues are for less privileged guest kernel drivers, but not for userspace
<DemiMarie>
Is GPU firmware any scarier than CPU microcode?
<gfxstrand>
That's what you have to use to susped a queue if you need to swap out memory or something.
<sima>
so still attack surface
<gfxstrand>
And when you run out of doorbells, you also have to use them.
<DemiMarie>
sima: depends on whether you are using SR-IOV or virtio-GPU native contexts
<sima>
DemiMarie, well for sriov
<DemiMarie>
sima: yeah
<DemiMarie>
native contexts have the advantage that they could work on AMD and Nvidia hardware too
guru__ has joined #dri-devel
<gfxstrand>
If you're doing virtio-GPU type thing then you would just expose the ring and maybe a doorbell page to userspace and pass that straight throug to the userspace in the VM.
benjaminl has quit [Ping timeout: 480 seconds]
<gfxstrand>
Suddenly the only thing any kernel has to get involved with is page migration.
<DemiMarie>
gfxstrand: interesting
<DemiMarie>
The native contexts I was thinking of are basically ioctl-level passthrough
<DemiMarie>
<gfxtrand> If you're doing virtio-GPU type thing then you would just expose the ring and maybe a doorbell page to userspace and pass that straight throug to the userspace in the VM.
<DemiMarie>
That’s an interesting idea.
<DemiMarie>
sima: do the GPU vendors do any sort of internal security reviews of their firmware?
<DemiMarie>
gfxstrand: thanks, this makes me feel better
guru_ has quit [Ping timeout: 480 seconds]
<DemiMarie>
Honestly, GPU hardware is actually quite a bit nicer than CPU hardware in that it doesn’t have to have obscene amounts of speculative execution to get good performance.
<DemiMarie>
Nor does it need to remain compatible with 20-year-old binaries.
<gfxstrand>
heh
<bcheng>
airlied: Lynne: ah give me one moment I can update the tags
* DemiMarie
tries to figure out if this is the first part of a longer series of comments
<HdkR>
DemiMarie: Don't worry, ARM hardware is fine with cutting out 20 years of cruft
<DemiMarie>
HdkR: and that is a good thing, but they still have the speculation issues
<airlied>
bcheng: oh cool I haven't moved much yet
<DemiMarie>
overall, how does GPU driver attack surface compare to CPU hypervisor attack surface?
<sima>
DemiMarie, not sure I should talk about that :-)
<DemiMarie>
sima: why?
<sima>
I work for such a vendor ...
<sima>
pretty sure I'm breaking all the rules if I disclose anything about the hw security process here
<bcheng>
airlied: I added some new stuff since your last review, including moving some stuff of gallium into util in case you missed it.
<sima>
like pretty much everything is cool, except hw security issues is absolute no fun or the CEO will know your name if you screw up
<DemiMarie>
sima: okay, fair
<DemiMarie>
so I will phrase things this way: to what degree would giving a guest access to GPU acceleration make life easier for an attacker?
<bcheng>
airlied: it's pretty trivial stuff, but not sure if someone else needs to ack it or something
<airlied>
my assiging it to marge is implicitly ack enough :)
<DemiMarie>
and can I trust that the company you work for would take a security problem with their GPUs (be it hardware or firmware) just as seriously as they would take a problem with their CPUs?
<airlied>
but I'll reread it now
<DemiMarie>
because those two things are ultimately what my users care about sima.
<airlied>
bcheng: just missed a license header on the new util c file
<sima>
intel's fixed a ton of device hw security bugs too over the past years (starting with the cpu smeltdown mess more or less), anything more is way beyond my pay grade
<sima>
(i.e. don't ask more questions, I wont answer them anyway)
<DemiMarie>
fair
<bcheng>
airlied: ah ok, and I assume because its copy-pasted I should just copy the license from there?
fab has quit [Quit: fab]
<sima>
(the device hw security fixes have been in a pile of press announcements, otherwise I couldn't tell you about those either)
<DemiMarie>
sima: “the guc queues are for less privileged guest kernel drivers, but not for userspace” tells me enough
<DemiMarie>
(in a good way)
<DemiMarie>
“(i.e. don't ask more questions, I wont answer them anyway)” — thanks for making this explicit
<sima>
yeah like I said, the rules around hw security bugs are extremely strict, can't even talk about them freely internally
<DemiMarie>
yikes
<sima>
if you're not on the list of people who know for a specific hw bug, you don't get to talk about it
crabbedhaloablut has quit []
Danct12 has quit [Quit: A-lined: User has been AVIVA lined]
<DemiMarie>
what matters most for me (and likely for my users) is that the GPU is considered a defended security boundary, which it obviously is
<DemiMarie>
(and to be clear: I do not expect a reply sima, so do not feel bad for being unable to give one)
<DemiMarie>
Also, if I made you uncomfortable, I apologise.
junaid has quit [Remote host closed the connection]
<DemiMarie>
sima: thank you for what you did say, it really does give me confidence in Intel’s handling of hardware vulns.
gouchi has quit [Quit: Quitte]
Danct12 has joined #dri-devel
benjaminl has joined #dri-devel
Surkow|laptop has quit [Quit: 418 I'm a teapot - NOP NOP NOP]
Surkow|laptop has joined #dri-devel
<Lynne>
bcheng: just copy paste the license and keep the headers the same
rosasco has joined #dri-devel
Duke`` has quit [Ping timeout: 480 seconds]
sima has quit [Ping timeout: 480 seconds]
rosasco has quit [Remote host closed the connection]
Haaninjo has quit [Quit: Ex-Chat]
guru__ has quit [Ping timeout: 480 seconds]
An0num0us has quit [Remote host closed the connection]