#dri-devel on 2023-08-23 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:45 ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar

00:00 guru__ has joined #dri-devel

00:01 <HdkR> ooo, rusticl on Zink. That's a powerful OpenCL driver right there

00:01 nchery is now known as Guest318

00:01 nchery has joined #dri-devel

00:03 <DavidHeidelberg[m]> I wonder if some highend nvidia with Zink -> rusticl would beat radeonsi in LLM

00:07 guru_ has quit [Ping timeout: 480 seconds]

00:07 bxkxknxnxmxkbnmm has joined #dri-devel

00:08 bxkxknxnxmxkbnmm has quit []

00:08 Guest318 has quit [Ping timeout: 480 seconds]

00:09 nchery has quit [Quit: Leaving]

00:12 bxkxknxnxmxkbnmm has joined #dri-devel

00:20 copper has joined #dri-devel

00:20 copper is now known as Guest322

00:24 Guest322 has quit []

00:26 <alyssa> HdkR: ruzticl! \ o /

00:27 <HdkR> :D

00:30 <zmike> no, no, no, you know the z has to replace a vowel

00:31 <airlied> rustzcl

00:32 <idr> rustizl fo shizzle

00:32 benjaminl has quit [Ping timeout: 480 seconds]

00:36 <zmike> let's just shorten to rzl

00:39 youmukonpaku133 has joined #dri-devel

00:39 youmukonpaku133 has quit []

00:40 youmukonpaku133 has joined #dri-devel

00:43 <DemiMarie> Is it just me or is it rather silly for Zink to convert NIR to SPIR-V only for the Vulkan frontend to convert SPIR-V back to NIR again? It seems like there should be some sort of Mesa-only extension that allows short-circuiting this when Mesa is also the Vulkan implementation.

00:47 <youmukonpaku133> that certainly does sound very silly to do

00:48 bxkxknxnxmxkbnmm has quit []

00:49 <zmike> it's been a topic of multiple blog posts

00:51 <alyssa> rzstzcl

00:54 kzd has joined #dri-devel

00:58 <DemiMarie> zmike: any links?

01:03 <youmukonpaku133> DemiMarie: supergoodcode.com youll find it there

01:03 co1umbarius has joined #dri-devel

01:05 columbarius has quit [Ping timeout: 480 seconds]

01:26 vignesh has joined #dri-devel

01:33 <kisak> DemiMarie: doesn't sound silly to me when game devs can bake zink into their game, and shortly after the NIR running inside the game isn't ABI compatible with the NIR in mesa.

01:35 <DemiMarie> kisak: what about when Zink is built as part of Mesa to support PowerVR? There the ABI _is_ guaranteed to match, so the round trip really is useless.

01:40 benjaminl has joined #dri-devel

01:44 yyds has joined #dri-devel

01:48 benjaminl has quit [Ping timeout: 480 seconds]

01:56 <HdkR> 14

02:00 chaserhkj has joined #dri-devel

02:02 <chaserhkj> umm, I might move myself too fast to post this on the issues: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9685 is this appropriate?

02:02 crcvxc has quit [Read error: Connection reset by peer]

02:03 <chaserhkj> also, can anyone confirm if we really don't have such bypasses in mesa?

02:03 bbrezillon has joined #dri-devel

02:09 bbrezill1 has quit [Ping timeout: 480 seconds]

02:12 ickle_ has joined #dri-devel

02:14 ickle has quit [Read error: Connection reset by peer]

02:21 yuq825 has joined #dri-devel

03:39 Guest279 is now known as nchery

03:49 youmukonpaku133 has quit [Read error: Connection reset by peer]

03:50 youmukonpaku133 has joined #dri-devel

03:56 Haaninjo has joined #dri-devel

04:00 kxkamil2 has joined #dri-devel

04:01 kxkamil has quit [Ping timeout: 480 seconds]

04:02 Haaninjo has quit [Quit: Ex-Chat]

04:03 aravind has joined #dri-devel

04:15 frankbinns1 has joined #dri-devel

04:16 JohnnyonFlame has joined #dri-devel

04:21 frankbinns has quit [Ping timeout: 480 seconds]

04:51 fab has joined #dri-devel

05:15 bgs has quit [Remote host closed the connection]

05:27 sgruszka has joined #dri-devel

05:42 kzd has quit [Ping timeout: 480 seconds]

05:43 sima has joined #dri-devel

06:02 bmodem has joined #dri-devel

06:06 camus has joined #dri-devel

06:10 camus1 has quit [Ping timeout: 480 seconds]

06:11 fab has quit [Quit: fab]

06:12 Duke`` has joined #dri-devel

06:16 mripard has joined #dri-devel

06:16 crabbedhaloablut has joined #dri-devel

06:17 Company has quit [Quit: Leaving]

06:25 pcercuei has joined #dri-devel

06:31 alanc has quit [Remote host closed the connection]

06:32 alanc has joined #dri-devel

06:55 Duke`` has quit [Ping timeout: 480 seconds]

07:00 Leopold_ has quit [Ping timeout: 480 seconds]

07:00 sghuge has quit [Remote host closed the connection]

07:00 sghuge has joined #dri-devel

07:03 Duke`` has joined #dri-devel

07:15 tristianc67 has joined #dri-devel

07:21 tristianc6 has quit [Ping timeout: 480 seconds]

07:24 frieder has joined #dri-devel

07:27 JohnnyonFlame has quit [Read error: Connection reset by peer]

07:32 <cmarcelo> anyone know if there is a particular reason vkrunner doesn't vendor vulkan headers (like mesa does)? it does make updating make-*.py scripts particularly hard, since they need support an unspecified range of vulkan header versions...

07:34 fab has joined #dri-devel

07:44 milek7 has quit [Remote host closed the connection]

07:52 donaldrobson has joined #dri-devel

07:53 <sima> airlied, can you ping me when you've done the msm pr and backmerge?

07:53 <sima> so I can rebase the drm-ci branch

07:53 <sima> robclark, koike ^^ if you have any updates for result files please incremental patch

07:59 lynxeye has joined #dri-devel

08:00 swalker__ has joined #dri-devel

08:02 swalker_ has joined #dri-devel

08:02 hansg has joined #dri-devel

08:03 swalker_ is now known as Guest362

08:04 milek7 has joined #dri-devel

08:08 swalker__ has quit [Ping timeout: 480 seconds]

08:20 tristianc670 has joined #dri-devel

08:24 milek7 has quit [Remote host closed the connection]

08:26 tristianc67 has quit [Ping timeout: 480 seconds]

08:29 tristianc6701 has joined #dri-devel

08:30 bmodem has quit [Ping timeout: 480 seconds]

08:30 dviola has left #dri-devel [WeeChat 4.0.3]

08:30 jkrzyszt has joined #dri-devel

08:31 dviola has joined #dri-devel

08:34 tristianc6704 has joined #dri-devel

08:36 youmukonpaku133 has quit [Read error: Connection reset by peer]

08:36 youmukonpaku133 has joined #dri-devel

08:36 tristianc670 has quit [Ping timeout: 480 seconds]

08:41 tristianc6701 has quit [Ping timeout: 480 seconds]

08:42 bmodem has joined #dri-devel

08:55 youmukonpaku133 has quit [Ping timeout: 480 seconds]

08:55 youmukonpaku133 has joined #dri-devel

09:00 <airlied> sima: oh yeah I was going to do that, put it on top for tmrw

09:03 cmichael has joined #dri-devel

09:09 <eric_engestrom> alyssa: for dEQP-EGL.functional.create_context.no_config IIRC it's a broken test, it assumes that if any GLES version is supported then all of them must be

09:09 <eric_engestrom> I think there's an issue opened in the khronos tracker already, let me have a look

09:10 <eric_engestrom> as for dEQP-EGL.functional.create_context.rgb888_no_depth_no_stencil it doesn't ring a bell, I'll have a look when I have some time

09:10 vliaskov has joined #dri-devel

09:15 <eric_engestrom> alyssa: https://gitlab.khronos.org/Tracker/vk-gl-cts/-/issues/3816 (reported by anholt)

09:15 crcvxc has joined #dri-devel

09:19 milek7 has joined #dri-devel

09:27 phasta has joined #dri-devel

09:33 milek7 has quit [Remote host closed the connection]

09:36 oneforall2 has joined #dri-devel

09:38 Duke`` has quit [Ping timeout: 480 seconds]

09:39 guru__ has quit [Ping timeout: 480 seconds]

09:40 youmukonpaku133 has quit [Ping timeout: 480 seconds]

09:42 aravind has quit [Ping timeout: 480 seconds]

09:43 qyliss has quit [Quit: bye]

09:45 qyliss has joined #dri-devel

09:52 qyliss has quit [Quit: bye]

09:55 qyliss has joined #dri-devel

10:04 fab has quit [Quit: fab]

10:05 Duke`` has joined #dri-devel

10:08 Guest211 has quit []

10:09 Danct12 has joined #dri-devel

10:13 TMM has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

10:13 TMM has joined #dri-devel

10:14 <alyssa> eric_engestrom: Aright. How should we proceed then?

10:27 milek7 has joined #dri-devel

10:29 mripard has quit [Quit: mripard]

10:35 Nefsen402 has quit [Remote host closed the connection]

10:35 bl4ckb0ne has quit [Remote host closed the connection]

10:35 emersion has quit [Remote host closed the connection]

10:35 Nefsen402 has joined #dri-devel

10:35 bl4ckb0ne has joined #dri-devel

10:35 emersion has joined #dri-devel

10:42 fab has joined #dri-devel

10:45 gawin has joined #dri-devel

10:50 YuGiOhJCJ has joined #dri-devel

10:50 mripard has joined #dri-devel

10:53 <karolherbst> soooo... nir_op_vec8 and nir_op_vec16 lowering. `nir_lower_alu_width` doesn't lower them on purpose and I was wondering what would be the best idea to deal with those

10:55 <karolherbst> I can also just lower them in rusticl though

11:01 <eric_engestrom> alyssa: for dEQP-EGL.functional.create_context.no_config I don't think there's a way to query which GLES version is supported by a given driver, so I'm inclined to consider the whole test invalid and it should be deleted

11:03 <eric_engestrom> but that's as far as I got when I looked into that a long time ago, maybe I missed something and it's possible to make this test valid

11:11 <alyssa> karolherbst: you shouldn't be getting them at all, if you lower width of all ALU and memory and then run opt_dce

11:11 <karolherbst> yeah.. that's what I thought, but I still have them.. but maybe it's just bad pass ordering... I'll deal with it once I've dealt with other vulkan validation errors :D

11:27 <alyssa> karolherbst: Still have them in what context?

11:27 <alyssa> What is reading them?

11:28 <alyssa> Oh, one other thing, you need to copyprop too

11:29 <karolherbst> yeah.. I think I do all of this, might just be in the wrong order

11:29 <karolherbst> or something weird

11:30 <karolherbst> or maybe something I've done fixed it now... I hav to find a test again which is running into this

11:30 <alyssa> yeah

11:30 <alyssa> This should work

11:36 chaserhkj has left #dri-devel [#dri-devel]

11:42 Leopold has joined #dri-devel

12:21 yyds has quit [Remote host closed the connection]

12:22 Duke`` has quit [Ping timeout: 480 seconds]

12:23 <zamundaaa[m]> robclark: btw what happened to the dma fence deadline stuff?

12:23 pekkari has joined #dri-devel

12:29 bmodem has quit [Ping timeout: 480 seconds]

12:29 bmodem has joined #dri-devel

12:32 bmodem has quit [Remote host closed the connection]

12:32 bmodem has joined #dri-devel

12:36 greenjustin has joined #dri-devel

12:40 psykose has joined #dri-devel

12:44 <Lynne> DavidHeidelberg[m]: 250 tokens, 26 seconds for rusticl on zink on a 6000 ada, 19 seconds on 6900XTX (rusticl on radeonsi)

12:45 greenjustin has quit [Remote host closed the connection]

12:45 greenjustin has joined #dri-devel

12:46 <austriancoder> alyssa: which nir pass lowers vec16, vec8 to vec4?

12:46 <alyssa> none

12:46 <alyssa> but if you lower everything that reads vec16 into smaller things, then there's nothing left to write vec16 either

12:49 youmukonpaku133 has joined #dri-devel

12:51 <austriancoder> ah..

13:00 bmodem has quit [Ping timeout: 480 seconds]

13:05 <robclark> zamundaaa[m]: all the non-uabi stuff landed.. the uabi stuff is just waiting for someone to implement some userspace support

13:06 <zamundaaa[m]> I can take care of that

13:14 youmukonpaku133 has quit [Ping timeout: 480 seconds]

13:15 youmukonpaku133 has joined #dri-devel

13:19 <emersion> gfxstrand: would VK_EXT_host_image_copy be useful for a wayland compositor?

13:21 crcvxc has quit [Read error: Connection reset by peer]

13:22 Duke`` has joined #dri-devel

13:27 alyssa has left #dri-devel [#dri-devel]

13:29 <Lynne> I can't think of a use for it, besides less blocky image downloads (copy to optimally tiled host image, convert layout on the CPU?)

13:30 <emersion> i think the big win would be to get rid of the staging buffer on setups where that makes sense

13:32 <emersion> this ext looks a lot like OpenGL's "here's a pointer to my host memory, plz do whatever to upload to GPU"

13:32 <Lynne> you can do that already by host mapping

13:32 <emersion> you mean VK_EXT_external_memory_host?

13:32 <Lynne> yes

13:33 <Lynne> you don't even need page size alignment for the source pointer

13:33 <emersion> i never managed to fully understand how this ext works

13:33 <Lynne> alloc a buffer, and allocate memory for it, only instead of allocating, chain a struct to use a host memory pointer

13:33 <emersion> i got some stride alignment issues iirc, and no way to discover the required alignment

13:33 <emersion> or something

13:33 <Lynne> that then allows you to let the GPU copy

13:34 <Lynne> nnnope

13:34 <emersion> would need to check again

13:34 <Lynne> you can workaround all alignment issues

13:34 <emersion> also VK_EXT_external_memory_host might required pinned memory when VK_EXT_host_image_copy doesn't?

13:34 <emersion> not sure

13:34 <emersion> require*

13:34 <Lynne> pinned memory?

13:35 <Lynne> host_image_copy is slower, not sure why you'd want to use it

13:35 youmukonpaku133 has quit [Ping timeout: 480 seconds]

13:35 <karolherbst> mhh

13:35 <karolherbst> so the vec8 thing I'm seeing looks like this:

13:35 <karolherbst> 8x8 %66 = vec8 %51, %53, %55, %57, %59, %61, %63, %65

13:35 <karolherbst> 8x4 %67 = iadd %49.abcd, %66.abcd

13:36 <karolherbst> 8x4 %68 = iadd %49.efgh, %66.efgh

13:36 <emersion> the host memory region can't be evicted from RAM while the driver does the upload, so that makes the kernel unhappy

13:36 <emersion> i'm not sure about the details

13:36 <Lynne> it's a mapping, not a buffer

13:37 <Lynne> it's not even a cached mapping

13:37 <emersion> VK_EXT_external_memory_host is not a mapping, it's "here's my host pointer, please make a VkBuffer from it", no?

13:37 youmukonpaku133 has joined #dri-devel

13:37 <Lynne> yes, it's up to how the driver implements it

13:39 <Lynne> as for alignment, like I said, although the extension requires the source pointer is aligned, you can simply pick any arbitrary address nearby which is aligned, then during the copy buffer to image command, you can offset the start point

13:39 Company has joined #dri-devel

13:39 <Lynne> both libplacebo and ffmpeg have been doing this for years with no bugs on any platform

13:39 <haasn> yeah it's UB but it works on all platforms I care about

13:40 <austriancoder> karolherbst: I you found a solution.. tell me

13:40 <austriancoder> 32x16 %76 = vec16 %19, %22, %25, %28, %31, %35, %39, %43, %47, %51, %55, %59, %63, %67, %71, %75

13:40 <austriancoder> 32x4 %82 = fmax %76.abcd, %79.xxxx

13:40 <austriancoder> 32x4 %83 = fmax %76.efgh, %79.xxxx

13:40 <austriancoder> 32x4 %84 = fmax %76.ijkl, %79.xxxx

13:40 <austriancoder> 32x4 %85 = fmax %76.mnop, %79.xxxx

13:41 <karolherbst> I don't, I think it's just not supported doing this on a vectorized shader

13:41 <emersion> Lynne: but how do you find out what is the required alignment?

13:41 <emersion> oh no i remember now

13:41 <emersion> VkPhysicalDeviceExternalMemoryHostPropertiesEXT.minImportedHostPointerAlignment exists

13:42 <karolherbst> I mean.. what should copy_prop do in this case?

13:42 <emersion> however

13:42 <karolherbst> it all neatly works once I just scalarize alus

13:42 <emersion> there is no way to communicate the stride when importing

13:42 <emersion> VkImportMemoryHostPointerInfoEXT just has a host pointer and that's it

13:43 <austriancoder> karolherbst: for my vec4 hw I would love to have 4x vec4 instead of the one vec16

13:43 <karolherbst> austriancoder: anyway, I kinda prefer of dealing with all of this in the frontend unless you have other use cases for vec8/16 you can probably start dropping the code

13:43 <karolherbst> yeah...

13:43 <karolherbst> zink also doens't want to scalarize

13:43 <Lynne> emersion: you communicate the stride during the buffer copy command

13:43 <karolherbst> I suspect we'll have to handle vec8/vec16

13:43 <Lynne> same as usual

13:43 <austriancoder> karolherbst: I would love if you can do it in the frontend (rusticl)

13:43 <emersion> eh

13:43 <emersion> that's right

13:43 <karolherbst> I already have a patch

13:44 <karolherbst> austriancoder: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24839/diffs?commit_id=49b480d5cf888090d1d47eb97f48150b26a512a4

13:44 <karolherbst> that should make almost everything work

13:44 <emersion> last time i looked at this i was a vulkan beginner :S

13:44 <karolherbst> there are just a handful of tests where it doesn't

13:44 <emersion> okay, i'll give it another go

13:45 <emersion> Lynne: also, for a wayland compositor which doesn't want to block, offloading host memory copies to a separate thread would make sense right?

13:46 <Lynne> for downloads, which are required to be synchronous, yes, otherwise for uploads, no, it's the GPU doing them

13:46 <Lynne> (not that they're required to be synchronous, it's just that most things you'd like to do with a buffer want it to be fully written already, if you carry around a fence it's not a problem)

13:47 <Lynne> you can look at the ffmpeg code as a reference on importing: https://github.com/cyanreg/FFmpeg/blob/vulkan/libavutil/hwcontext_vulkan.c#L3424-L3477

13:47 <emersion> oh very nice

13:48 <emersion> Lynne: why are downloads required to be synchronous?

13:48 <emersion> can't you queue the copy to the host VkBuffer?

13:48 <emersion> and then carry on?

13:48 <Lynne> they're not, but if the thing on the other end doesn't understand fences, you have to wait

13:48 <Lynne> for wayland, it's not a problem

13:49 <emersion> so, the thing on the other side is just my own code, so that's probably fine

13:49 <Lynne> sure

13:49 <emersion> i basically want a VkSemaphore which i can export as a sync_file and integrate in my event loop

13:49 <emersion> to wake a callback when the copy is done

13:49 youmukonpaku133 has quit [Ping timeout: 480 seconds]

13:50 <Lynne> by the way, if you try to import exported memory, vkGetMemoryHostPointerPropertiesEXT will succeed, but buffer creation will fail

13:50 <Lynne> so you should deal with failure there

13:50 <emersion> import exported memory?

13:50 <austriancoder> karolherbst: thx

13:50 <Lynne> yes, if earlier on, some code maps GPU memory to a pointer, and you're passed a pointer, which you try to map, it will fail

13:50 <emersion> what does that mean exactly? the ext only does import, not export

13:51 youmukonpaku133 has joined #dri-devel

13:51 <karolherbst> but anyway.. I don't see a solution besides handling vec8/16 explicitly. Maybe we need a flag for nir_lower_alu_width to toggle it

13:51 <Lynne> I've been asking for someone here to fix vkGetMemoryHostPointerPropertiesEXT finally, but no one seems interested :(

13:51 <karolherbst> and then if you scalarize anyway you just disable vec8/16 handling

13:52 <emersion> okay, basically, if the pointer i pass to VK_EXT_external_memory_host is "bad", then the VkBuffer creation fails

13:52 <emersion> and i can fallback to what i'm doping right now

13:52 <emersion> doing*

13:52 <emersion> good to know

13:52 pekkari has quit [Read error: Connection reset by peer]

13:52 <karolherbst> or we handle it differently.. dunno

13:52 <karolherbst> I think the issue is, that touching vec8/16 there (or any vecs) is that it messes around with vectorize load/stores

13:53 <emersion> eh i just realized i don't really need to plumb callbacks in my API even

13:54 <emersion> i can just wait on the fence GPU-side when the texture is first sampled from

13:54 <emersion> i do need to internally have a callback to release the shm buffer

13:55 pekkari has joined #dri-devel

13:55 pekkari has quit []

13:56 <Lynne> you just have to make sure the buffer exists from the time of the call to the point you wait on the fence

13:56 <emersion> yup

13:59 youmukonpaku133 has quit [Read error: Connection reset by peer]

14:00 youmukonpaku133 has joined #dri-devel

14:01 benjaminl has joined #dri-devel

14:01 youmukonpaku133 has quit [Read error: Connection reset by peer]

14:02 youmukonpaku133 has joined #dri-devel

14:08 yuq825 has left #dri-devel [#dri-devel]

14:10 Haaninjo has joined #dri-devel

14:14 <gfxstrand> emersion: For SHM clients, sure.

14:14 <gfxstrand> It'd actually work pretty swell for SHM

14:16 kts has joined #dri-devel

14:16 yyds has joined #dri-devel

14:19 <emersion> gfxstrand: VK_EXT_external_memory_host should probably be preferred when available?

14:19 <emersion> VK_EXT_external_memory_host, then VK_EXT_host_image_copy, then regular two-stage copy with staging buffer?

14:20 kzd has joined #dri-devel

14:20 benjaminl has quit [Ping timeout: 480 seconds]

14:20 YuGiOhJCJ has quit [Quit: YuGiOhJCJ]

14:23 <karolherbst> gfxstrand: btw, any ideas on how to deal with vec8/16 ops in vectorized nirs? I was playing around with lowering all vec16/vec8 to vec4 before passing it into drivers (or lower to scalar if they request this), but there are some leftovers like this: https://gist.githubusercontent.com/karolherbst/3fda5dd969d2b254b7865458b7d2092b/raw/7897b482743d65433435910d56c5d6b2bb8a149e/gistfile1.txt

14:25 <robclark> zamundaaa[m]: ok, cool.. I expect the uabi patches should still apply cleanly but lmk if you need me to re-post

14:26 <zamundaaa[m]> robclark I'm not 100% sure which patches I need to pick. Do you have a branch somewhere to make it easier to test?

14:28 <Lynne> emersion: if you want three separate versions, but I don't think host_image_copy is popular enough yet, and pretty much everything implements host mapping

14:28 <emersion> yeah

14:30 Duke`` has quit [Ping timeout: 480 seconds]

14:33 gawin has quit [Ping timeout: 480 seconds]

14:36 greenjustin_ has joined #dri-devel

14:37 <gfxstrand> emersion: Which is optimal is going to depend on the hardware and the behavior of the app. If it's changing every frame and you can sample directly from it, VK_EXT_external_memory_host is probably faster. If it's going to sit there for a couple of frames, host copy is probably faster.

14:38 <emersion> gfxstrand: by VK_EXT_external_memory_host, i don't mean sample directly from the host memory VkBuffer

14:38 <gfxstrand> You mean CopyBufferToImage?

14:38 <emersion> there would still be a copy to a GPU buffer

14:39 <gfxstrand> In that case, I'd assume a host image copy is the fastest.

14:39 <emersion> wrap my shm client's buffer with a VK_EXT_external_memory_host VkBuffer, then copy that buffer to the final image

14:39 <gfxstrand> host buffers have all sorts of potential perf problems.

14:39 <emersion> hm, can you elaborate?

14:40 <karolherbst> or maybe we should just ask drivers to deal with nir_op_vec8/16 as otherwise we might risk devectorizing load/stores or something.. dunno.. maybe it's not a big deal at this vector size. Maybe I should just add a flag to nir_lower_alu_width indeed, or let the filter deal with this

14:40 <emersion> is there a way i can use them to avoid the problems?

14:40 <karolherbst> but it's messy

14:40 <gfxstrand> karolherbst: Currently, even fully scalar drivers have to deal with vec8/16

14:40 <karolherbst> yeah...

14:40 <karolherbst> for zink that's just super annoying

14:40 <gfxstrand> You can break it up, it's just tricky.

14:40 <karolherbst> sure

14:41 <karolherbst> but I think `nir_lower_alu_width` is probably the best place for it, no?

14:41 <karolherbst> or a special pass?

14:41 <gfxstrand> IDK

14:41 <gfxstrand> IMO, a Zink-specific pass is probably the way to go

14:41 <karolherbst> yeah.. probably

14:42 <gfxstrand> Something that replaces any read of vec8/16 with creating a subvec.

14:42 <karolherbst> austriancoder: how much pain would it be for you to _only_ deal with nir_op_vec8/16 opcodes and everything else is vec4 at most?

14:42 <gfxstrand> Then copy-prop should delete the vec8/16

14:42 <gfxstrand> karolherbst: Oh, right... there are vec4 GPUs that exist. *sigh*

14:42 <karolherbst> yeah.. it's probably easier to deal with this from the alu op perspective

14:42 greenjustin has quit [Ping timeout: 480 seconds]

14:42 <karolherbst> yeah....

14:42 <gfxstrand> So, yeah, maybe make it general but we probably do want to make it special.

14:42 <gfxstrand> For scalar drivers, it's easier to just handle it.

14:43 <zamundaaa[m]> robclark here's a simple implementation, which sets the deadline for compositing and direct scanout: https://invent.kde.org/plasma/kwin/-/merge_requests/4358

14:43 <karolherbst> I was also thinking about adding a flag to nir_lower_alu_width and then it moves sources into vec4

14:43 <karolherbst> or something

14:43 <karolherbst> like

14:43 <karolherbst> if a alu references a vec8 source, just extract the relevant channels into a vec4 and consume this

14:43 <karolherbst> and then hope every vec8/16 gets removed

14:43 <karolherbst> which it kinda should

14:44 <robclark> zamundaaa[m]: ok, cool, that is just what I was waiting for to re-post..

14:44 <karolherbst> anyway, for scalar it already works

14:45 <karolherbst> or at least lowerinv vec8/16 to vec4 works, because the scalarize things scalarizes stuff :D

14:45 <karolherbst> anyway, will play around with nir_lower_alu_width then

14:46 <gfxstrand> emersion: Couple of things. One is that it will always live in system RAM and have all the access problems that come with that. In theory a CopyBufferToImage *should* be fast but it also means round-tripping to the kernel to create a BO and kicking something to the GPU and tracking that host image buffer etc. The CopyMemoryToImage case should do a WC write from the CPU and there's no extra

14:46 crabbedhaloablut has quit [Remote host closed the connection]

14:46 <gfxstrand> resources to track. Both should be able to saturate PCIe, I think. The resulting image should be about as fast, probably.

14:46 <austriancoder> karolherbst: not much .. I think .. but I would spend an hours or so on some nir magic

14:46 <karolherbst> mhh.. yeah...

14:46 <karolherbst> for scalar backends all this stuff is trivial

14:46 <gfxstrand> emersion: So I guess IDK which is going to be faster in an absolute sense. Host copies are certainly easier and involve less tracking and less stuff you have to kick to the GPU.

14:46 <gfxstrand> Probably depends on where your system is loaded.

14:47 <karolherbst> but vectorized things are annoying as you can have random swizzles and that nonsense

14:47 phasta has quit [Ping timeout: 480 seconds]

14:47 crabbedhaloablut has joined #dri-devel

14:47 <austriancoder> yeah..

14:47 <karolherbst> though a pass to materialize num_component > 4 sources into their own vec4 shouldn't be tooo hard

14:47 <karolherbst> probably pretty straight forward

14:47 <karolherbst> I'll write something up

14:48 <gfxstrand> karolherbst: So the thing I think scalar back-ends want to avoid is having extra vec instructions that get turned into extra movs that it has to copy-prop out later.

14:48 <karolherbst> yeah..

14:48 <karolherbst> this should absolutely be optional

14:48 <gfxstrand> Anything which actually takes a vector, like a texture op or intrinsic will get a vector that's already the right width.

14:49 <gfxstrand> They really exist just for weird ALU swizzle cases.

14:49 <karolherbst> yep

14:49 <gfxstrand> Which I'm now questioning....

14:49 <emersion> gfxstrand: hm if i keep the host pointer VkBuffer around, does that help the kernel not re-create the BO over and over again?

14:49 <gfxstrand> Like, if all your ALU ops are scalar, why can't copy-prop get rid of the vector...

14:49 <gfxstrand> :thinking:

14:50 <karolherbst> it works for scalar

14:50 <gfxstrand> emersion: Yeah. There is weird locking stuff in the kernel, though, and you'll end up paying that cost on every submit, even submits that aren't your copy (thanks, bindless)

14:50 <gfxstrand> emersion: But we're seriously micro-optimizing at that point.

14:50 <emersion> i do get that host memory is more work for the driver/kernel, since it's work i'm not doing anymore :P

14:51 <emersion> ok

14:51 <emersion> i guess i'll just try and make sure we don't regress

14:51 <emersion> and see how it goes on my hw collection

14:51 <gfxstrand> sounds good

14:51 <gfxstrand> But, generally, the host copy path is probably the path you're going to get in a GL driver when you TexSubImage with properly matching formats.

14:52 <gfxstrand> (Where by host copy I mean CopyMemoryToImage)

14:52 alyssa has joined #dri-devel

14:52 <emersion> right

14:53 <alyssa> DavidHeidelberg[m]: FWIW, I disabled the T720 job

14:53 <alyssa> It seems that either the T720 devices are down, or they're absolutely overloaded right now, or a combination of the two

14:53 <emersion> that's exactly what my mental model was :P "VK_whatever_gl_does"

14:53 <emersion> gfxstrand: so this host copy path potentially allocates memory?

14:53 <alyssa> marge pipeline had t720 waiting for a device for 45+ minutes without making progress, so I pushed the disable.

14:54 <emersion> or would a driver that needs to allocate memory just not implement the ext?

14:54 <alyssa> that marge pipeline was lost (reassigned, we'll try this again) but no other collateral damage and hopefully the rest of the collabora farm is ok

14:55 <gfxstrand> emersion: No, it shouldn't allocate memory.

14:56 <gfxstrand> Well, it might need to malloc() a bit depending on $DETAILS

14:56 <emersion> ok, good

14:56 alyssa has left #dri-devel [#dri-devel]

14:56 <gfxstrand> But it shouldn't need to allocate GPU memory

14:56 <emersion> sure, i meant memory for buffer

14:56 <gfxstrand> And it won't submit anything to the GPU

14:56 <emersion> ah, interesting

14:56 <emersion> right

14:56 <emersion> no VkCommandBuffer

14:56 <gfxstrand> On Intel, it'll be an isl_tiled_memcpy()

14:57 <gfxstrand> Feel free to go read that eldritch horror if you wish. I cannot guarantee your sanity will be intact on the other end, though. :joy

14:57 <gfxstrand> It's even able to use SSE to do BGRA swizzles!

14:58 guru_ has joined #dri-devel

14:59 <emersion> lol

15:01 <gfxstrand> I'm hopefull that the NVK implementation will be substantially more sane but that's still TBD.

15:01 <gfxstrand> Some of it is just because tiling is complicated

15:02 <gfxstrand> in general

15:03 oneforall2 has quit [Ping timeout: 480 seconds]

15:04 gawin has joined #dri-devel

15:04 gawin has quit []

15:06 guru__ has joined #dri-devel

15:08 cmichael has quit [Quit: Leaving]

15:12 guru_ has quit [Ping timeout: 480 seconds]

15:14 youmukonpaku133 has quit [Ping timeout: 480 seconds]

15:14 youmukonpaku133 has joined #dri-devel

15:15 junaid has joined #dri-devel

15:15 alyssa has joined #dri-devel

15:15 <alyssa> gfxstrand: I'm not sure what to make of host copy for agx or mali

15:16 <alyssa> We have a CPU path for tiled images, but the GL driver almost never uses it

15:16 <alyssa> since most of the time we use compressed images, which are accessed via a GPU blit to/from a linear image

15:17 <alyssa> the tiled CPU path is used for ASTC/BCn and that's usually it

15:17 <alyssa> so I can't think of a situation where host image copy would ever be the right thing to do

15:17 <alyssa> for this class of hw

15:18 <alyssa> CopyBufferToImage (or vice-versa) or maybe BlitImage with a Linear image will typically be the optimal path

15:18 <alyssa> de/compressing on the CPU is a nonstarter

15:19 <zmike> VkHostImageCopyDevicePerformanceQueryEXT ?

15:19 <alyssa> Sure

15:20 <alyssa> Mostly I'm surprised there's hardware that *doesn't* benefit from using its framebuffer compression for textures

15:20 <DavidHeidelberg[m]> alyssa: if not broken completly, always switch it to the manual, `.` is usually forgotten

15:20 <alyssa> I don

15:20 <alyssa> t know how to do that

15:21 <DavidHeidelberg[m]> s/.panfrost-midgard-rules/.panfrost-midgard-manual-rules:/ :)

15:21 <alyssa> midgard is fine

15:21 <alyssa> it's just t720 that needs to be wacked

15:21 <alyssa> oh in the extends

15:21 <DavidHeidelberg[m]> yup :)

15:21 <alyssa> ok that doesn't seem terrible i guess

15:21 <alyssa> still a lot more likely to screw it up

15:22 Leopold has quit [Remote host closed the connection]

15:22 alyssa has left #dri-devel [#dri-devel]

15:22 guru__ has quit [Read error: No route to host]

15:22 oneforall2 has joined #dri-devel

15:22 Leopold has joined #dri-devel

15:23 vliaskov has quit [Read error: No route to host]

15:25 <daniels> DavidHeidelberg[m]: well alyssa is gone, but https://gitlab.freedesktop.org/krz/mesa/-/jobs/47878513 went through ~immediately, and those boards are extremely under-loaded, so I think it's just some weird temporal issue; perhaps a recurrence of the old issue where gitlab just wasn't handing out jobs when it should

15:27 kts_ has joined #dri-devel

15:29 <DavidHeidelberg[m]> it look like it https://lava.collabora.dev/scheduler/device_type/sun50i-h6-pine-h64

15:29 <DavidHeidelberg[m]> daniels: I'll revert and run it few times so we're sure it's fine

15:31 kts has quit [Ping timeout: 480 seconds]

15:37 mripard has quit [Ping timeout: 480 seconds]

15:46 benjaminl has joined #dri-devel

15:47 hansg has quit [Quit: Leaving]

15:49 guru_ has joined #dri-devel

15:51 <Lynne> emersion: the overhead of creating a buffer from host memory wasn't significant enough in my tests

15:52 <Lynne> on every frame

15:52 aravind has joined #dri-devel

15:52 <emersion> cool!

15:55 oneforall2 has quit [Ping timeout: 480 seconds]

16:02 <zmike> DavidHeidelberg[m]: I got another trace for you

16:03 <zmike> https://gitlab.freedesktop.org/zmike/dumps/-/blob/master/blender.3-trim.trace hits some unique texture upload mechanics that (apparently) nothing else in ci hits

16:05 mripard has joined #dri-devel

16:06 <DavidHeidelberg[m]> Nice! I'll check in 2-3 hours ;)

16:06 <zmike> it's single frame and pretty small

16:16 guru__ has joined #dri-devel

16:21 kts_ has quit []

16:21 mripard has quit [Quit: mripard]

16:23 guru_ has quit [Ping timeout: 480 seconds]

16:24 <karolherbst> do we have a proper way to replace alu srcs or would I ahve to just recreate the alu instruction?

16:31 <gfxstrand> karolherbst: What do you mean?

16:32 <gfxstrand> nir_src_rewrite() should work. You just have to fixup the swizzle yourself.

16:32 frieder has quit [Remote host closed the connection]

16:32 <karolherbst> yeah, already found it

16:35 <gfxstrand> karolherbst: Okay, I'm now very confused how vec8 ends up in back-ends

16:35 <gfxstrand> Maybe it's via phis?

16:35 kxkamil2 has quit []

16:35 <karolherbst> nah

16:35 <karolherbst> it's just silly stuff

16:35 <karolherbst> did you check out the shader?

16:35 <gfxstrand> Yeah, that one has actual vec4 ALU stuff

16:36 <karolherbst> yep

16:36 <gfxstrand> But if you don't have vec4 ALU stuff, how does it happen?!?

16:36 <karolherbst> well.. the shader had originally vec8 alu ops

16:37 <karolherbst> so, nir_lower_alu_width split that up, but we are left with that silly vec8

16:37 Guest362 has quit [Remote host closed the connection]

16:39 donaldrobson has quit [Ping timeout: 480 seconds]

16:39 <gfxstrand> But copy-prop should clean up once every ALU source is scalar

16:40 <karolherbst> but they never become scalar

16:40 <karolherbst> not for zink at least

16:40 <gfxstrand> Well, yeah, Zink is vec4 hardware, effectively.

16:40 <gfxstrand> I'm wondering why Intel has support

16:40 <karolherbst> yeah

16:40 <gfxstrand> I know there was a case

16:40 <karolherbst> uhh.. I've added vec8/16 support to a bunch of drivers (or somebody else) so a lot of drivers are fine

16:40 <karolherbst> _but_

16:41 <karolherbst> if the driver asks for lower_to_scalar then they get scalar alus

16:41 <karolherbst> and copy prop cleans it all up

16:41 <gfxstrand> I mean, for RT, I added some spicy wide vec loads for Intel but that's a different issue.

16:41 <karolherbst> anyway.. got it working :D

16:41 <karolherbst> I'm sure it's super sketchy

16:42 aravind has quit [Ping timeout: 480 seconds]

16:43 <karolherbst> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24839/diffs?commit_id=30ca5e3fb37e1c7123b41f3c5359b42148f46c98

16:43 <karolherbst> anyway.. I think it's fair for rusticl to reduce the vecs to vec4 and then we can drop a bunch of vec8/16 code from backends again... or maybe we keep it and drivers can say what they support

16:43 <karolherbst> or

16:44 <karolherbst> drivers can lower it themselves

16:44 <karolherbst> dunno

16:44 <karolherbst> for now I just want something which works

16:46 junaid has quit [Remote host closed the connection]

16:47 <karolherbst> ehh.. I've commit my prints, oops

16:47 guru_ has joined #dri-devel

16:47 <karolherbst> and it's also broken

16:47 <karolherbst> sad

16:48 kxkamil has joined #dri-devel

16:50 <karolherbst> yooooo

16:50 <karolherbst> copy_prop reverses my think

16:50 <karolherbst> *thing

16:50 <karolherbst> :<

16:51 <karolherbst> yeah so this isn't great

16:51 yyds has quit [Remote host closed the connection]

16:52 <karolherbst> https://gist.githubusercontent.com/karolherbst/bb0405538df026ac91551e8c751ff6eb/raw/5941978624d568864c98c7d1a95d73f9fc4cb2c1/gistfile1.txt

16:52 <karolherbst> maybe I have to do that later...

16:52 jkrzyszt has quit [Ping timeout: 480 seconds]

16:53 <karolherbst> and only if the source comes from a vec? mhh...

16:53 guru__ has quit [Ping timeout: 480 seconds]

16:55 sgruszka has quit [Remote host closed the connection]

17:05 guru__ has joined #dri-devel

17:10 guru_ has quit [Read error: No route to host]

17:12 guru_ has joined #dri-devel

17:13 benjamin1 has joined #dri-devel

17:13 <karolherbst> Pass 2350 Fails 68 Crashes 36 Timeouts 0: 100% and 60 of those fails are ignorable images tests with NEAREST filtering... I think that's probably as good as it needs to be to focus on cleaning up that MR :D

17:13 <karolherbst> (that's on ANV btw)

17:13 <gfxstrand> \o/

17:14 <gfxstrand> karolherbst: Ugh... copy lowering is going to be a mess.

17:14 <karolherbst> radv is a bit worse, but ACO doesn't support any of the CL nonsense, so that's not surprising

17:14 <gfxstrand> I think we need a per-stage array of nir_variable_mode which supports indirects.

17:14 <gfxstrand> We have so many things which are currently keying off of misc things

17:15 <gfxstrand> If nir_lower_var_copies were happening entirely in drivers, it'd be okay, but it's not.

17:15 <karolherbst> last radv run was "Pass 2222 Fails 15 Crashes 217 Timeouts 0: 100%", wondering how much that final vec8/16 lowering will help

17:15 <karolherbst> lvp is a bit better

17:15 <karolherbst> anyway

17:15 <karolherbst> good enough on all drivers

17:16 <gfxstrand> NIR lowering has started becomming an eldritch horror

17:16 <karolherbst> yeah....

17:16 <karolherbst> it's a mess

17:17 <gfxstrand> s/started becoming/become/

17:17 <gfxstrand> It happened a long time ago

17:17 youmukonpaku133 has quit []

17:17 <karolherbst> I also procrastinated reworking the nir pipeline, because.......

17:17 youmukonpaku133 has joined #dri-devel

17:17 <karolherbst> it works and I don't want to break it

17:17 <gfxstrand> At some point, I should do an audit and make some opinionated choices.

17:18 <karolherbst> yeah

17:18 <gfxstrand> But ugh.....

17:18 <karolherbst> that would help

17:18 <karolherbst> :D

17:18 <gfxstrand> There's no way I don't break the universe in that process.

17:18 <gfxstrand> And I'd much rather write NVK code

17:18 <karolherbst> fair enough

17:18 guru__ has quit [Ping timeout: 480 seconds]

17:18 <karolherbst> you can just comment on bad ordering and I can figure out the details

17:18 <gfxstrand> Like WTH is st_nir_builtins.c lowering var copies?!?

17:18 <karolherbst> heh

17:18 <karolherbst> :D

17:18 <gfxstrand> It's worse than that

17:18 <gfxstrand> I can't just drop some comments places.

17:19 <karolherbst> there are probably good reasons for it (tm)

17:19 <gfxstrand> Oh, I'm sure there is a reason

17:19 <gfxstrand> There's a reason for every line of code

17:19 <karolherbst> ahh you meant the entire st/mesa stuff?

17:19 <gfxstrand> Whether or not it's a good reason is TBD.

17:19 <gfxstrand> Oh, I mean EVERYTHING

17:19 <karolherbst> uhhh

17:19 <gfxstrand> Vulkan, OpenCL, st/mesa, various back-ends. The lot.

17:19 <karolherbst> maybe... work on NAK instead and hope we can just drop gl drivers in the future?

17:19 <zmike> <gfxstrand> There's a reason for every line of code

17:19 <karolherbst> :D

17:19 <zmike> https://i.pinimg.com/736x/fb/4e/07/fb4e07ea3c37c426dd0a2b3344b1ff85.jpg

17:19 benjaminl has quit [Ping timeout: 480 seconds]

17:19 <karolherbst> ahh yeah...

17:20 <karolherbst> _maybe_ we just need a better model

17:20 <karolherbst> or rather.. some stages of nir where certain things are allowed/forbidden

17:20 <karolherbst> but uhhh

17:21 <karolherbst> no idea really

17:23 <gfxstrand> I think the #1 thing we need is someone with the appropriate credentials to be opinionated.

17:23 <gfxstrand> Which basically boils down to that.

17:23 <gfxstrand> We may also want some helpers to assert said opinions.

17:24 <gfxstrand> I've mostly looked the other way while the st/mesa stuff was built because, well, Ew, GL.

17:24 <karolherbst> fair

17:24 <gfxstrand> That was maybe a mistake.

17:25 <gfxstrand> In any case, I think it's well past time for a cleanup. Sadly, there are very few people who really understand all of NIR and the various APIs in play and back-end needs well enough to do that clean-up.

17:25 <gfxstrand> And by "very few" I mean "me, maybe". :joy:

17:26 <karolherbst> oh wow.. my vec8/16 patch really helped a lot with lvp... from ~200 carshes down to ~50... but a bunch of fails, but nothing serious

17:26 <karolherbst> so I guess what I have kinda works now...

17:26 <karolherbst> uhhh

17:26 <karolherbst> pain

17:26 <karolherbst> I wished it wouldn't be so hakcy

17:30 <daniels> DavidHeidelberg[m]: huh, is CelShading new on T860? it hangs reliably on your MR …

17:30 <Lynne> airlied: could you push !24572?

17:31 lynxeye has quit [Quit: Leaving.]

17:32 junaid has joined #dri-devel

17:36 youmukonpaku133 has quit [Ping timeout: 480 seconds]

17:40 <karolherbst> I should probably check if it runs on nvidias driver by now...

17:53 <karolherbst> uhhh.. there are things like ieq %44.abcd, %45 (0x0).xxxx what my code doesn't account for.. pain

17:54 <karolherbst> that was easy to fix

17:56 <anarsoul> what does GL spec say on empty fragment shader, i.e. void main(void) {}, without actually writing any outputs?

17:56 <gfxstrand> I think you get undefined garbage

17:56 <Lynne> speaking of rusticl on nvidia on zink, it runs but it generates junk

17:57 <karolherbst> Lynne: try my most recent rusticl/tmp branch and run with RUSTICL_DEBUG=sync

17:57 <karolherbst> I won't promise it will work, but it should work better

17:57 heat has joined #dri-devel

17:57 <karolherbst> something is up with the fencing with zink and rusticl and I stll have to figure that out

17:58 <karolherbst> but that _might_ be enough to run luxball or something

18:02 <karolherbst> ehhh maybe I've broken it.. try rusticl/zink instead :D

18:04 <karolherbst> uhhhhhh

18:04 <karolherbst> load_const

18:04 <karolherbst> uhhhhh

18:04 <karolherbst> pain

18:04 <karolherbst> now copy_prop reverts my vec4 on load_const..

18:04 <karolherbst> what a pain

18:05 <karolherbst> I think I need a better model on how to do that lowering

18:05 <karolherbst> or I just split load_const

18:07 <zmike> anarsoul: in what sense

18:07 tristianc6704 has quit [Read error: Connection reset by peer]

18:08 <Lynne> karolherbst: yup, everything is correct with RUSTICL_DEBUG=sync on rusticl/zink

18:08 <anarsoul> zmike: see https://gitlab.freedesktop.org/mesa/mesa/-/issues/9689

18:08 <karolherbst> Lynne: but is it faster than nvidia's CL stack

18:08 <anarsoul> shaders/tesseract/270.shader_test has a FS with an empty body

18:08 <Lynne> what was the envvar to select an opencl device again?

18:09 <karolherbst> uhhh...

18:09 <zmike> anarsoul: not sure exactly what you're asking

18:09 <zmike> it just does nothing

18:09 <karolherbst> you can normally do that in apps

18:09 <anarsoul> zmike: I'm asking what is expected result of it. gfxstrand already replied that it's undefined

18:09 <zmike> you mean the result in the framebuffer?

18:09 <anarsoul> yeah

18:10 <karolherbst> do we have a helper to extract a const channel used by an alu instruction?

18:10 <Lynne> llama.cpp just picks the first device afaik

18:10 <karolherbst> ehh.. maybe I just use nir_build_imm

18:10 <zmike> anarsoul: "Any colors, or color components, associated with a fragment that are not written by the fragment shader are undefined."

18:10 <zmike> from 4.6 compat spec

18:10 <zmike> but this is in the compat language section

18:11 <anarsoul> zmike: thanks!

18:13 <Lynne> found a workaround, yes, nvidia's opencl stack is faster

18:13 <karolherbst> not the answer I wanted to hear :P

18:13 <Lynne> around 13 times faster

18:14 <karolherbst> uhh

18:14 <karolherbst> annoying

18:14 <karolherbst> that's quite a bit

18:14 <karolherbst> which benchmark did you use? luxball?

18:14 <Lynne> llama.cpp

18:14 <karolherbst> ahh mayb that's expected

18:14 <Lynne> I'll try ffmpeg's nlmeans filter

18:15 <karolherbst> nah, most of those benchmarks actually test something else

18:15 <karolherbst> the only I know is realibly testing kernel execution speed is luxmark

18:15 <karolherbst> the API overhead is..... significant if you do debug builds and it's not very optimized anyway

18:16 <karolherbst> mostly interested in raw kernel execution speed, because that's probably the hardest part to get really performant

18:16 <karolherbst> and I also don't know if other benchmarks actually validate the result or not

18:17 <DavidHeidelberg[m]> <daniels> "David Heidelberg: huh, is..." <- Old, very old. Sonething is wrong.

18:17 <Lynne> okay, ffmpeg's nlmeans_opencl filter is a little bit faster on zink on nvidia!

18:17 <karolherbst> yeah, but is the result correct?

18:17 <karolherbst> zmike: ^^ you know what to do

18:17 <zmike> MICHAEL

18:17 <zmike> MICHAEL WHERE IS THE POST

18:17 <karolherbst> but being faster with luxball would be impressive

18:18 <karolherbst> because that's actually 100% kernel throughput

18:18 <karolherbst> and no funny business

18:18 <karolherbst> it's basically benchmarking the compiler

18:18 <Lynne> looks correct

18:19 <karolherbst> the issue with llama.cpp is that it probably uses fp16 and other funky extensions we don't support, and probably matrix stuff and.. uhh.. cuda on nvidia

18:19 <karolherbst> so dunno if that's even remotely compareable

18:19 <karolherbst> also.. it's heavily optimized for nvidia

18:20 <karolherbst> Lynne: nice :)

18:20 <karolherbst> okay.. I'm getting close ...

18:20 heat_ has joined #dri-devel

18:20 <karolherbst> maybe official zink conformance this year would be possible :D

18:20 <Lynne> nah, llama.cpp's cl code is pretty meh, pretty much it all is except cuda

18:21 <karolherbst> fair

18:21 heat has quit [Remote host closed the connection]

18:21 <karolherbst> but did you compare zink vs cuda or cl?

18:21 <Lynne> I should try fp16 though, is it supported on zink?

18:21 <karolherbst> no

18:21 gouchi has joined #dri-devel

18:21 <karolherbst> it's not supported in rusticl at all atm

18:21 <karolherbst> for.... annoying reasons

18:21 <Lynne> I meant through the features flag?

18:21 <karolherbst> "Pass 2374 Fails 69 Crashes 11 Timeouts 0: 100%" getting close (and again, like 60 fails won't matter)

18:21 <Lynne> correctness is another chapter

18:22 <zmike> fp16 is supported in zink

18:22 <karolherbst> RUSTICL_FEATURES=fp16

18:22 Duke`` has joined #dri-devel

18:23 <karolherbst> I'll probably test it on my GPU later this week :D but for now I want it to work on mesa drivers

18:23 <Lynne> fp16 output is correct, but it's not any faster

18:23 <karolherbst> for conformance over zink, would 3 mesa drivers count or do I have to use different vendors?

18:24 <karolherbst> annoying

18:24 <karolherbst> maybe it's not using it? maybe it's broken? maybe all of it? I have no idea :)

18:24 <zmike> you'd need to see what the conformance requirements are for CL

18:24 <karolherbst> it's mostly blocked on libclc not supporting it

18:24 <zmike> in GL it's 2

18:24 <Lynne> ggml_opencl: device FP16 support: true

18:24 <Lynne> it is using it

18:24 <karolherbst> ahh

18:24 <karolherbst> I suspect the slowness comes from something else

18:24 <karolherbst> like the API

18:25 <karolherbst> rusticl still is doing a few silly things

18:25 <zmike> better drop everything and debug this karol

18:25 <karolherbst> but that lama.cpp actually works is kinda neat

18:25 <zmike> your p-score is plummeting

18:25 <karolherbst> yeah I know

18:25 <karolherbst> that's why I want a higher score on luxball than nvidia

18:25 <zmike> absolute freefall

18:25 <Lynne> out for blood, eh? good luck

18:26 <karolherbst> zmike: right.. but it can't be like 2 mesa drivers, can it?

18:27 <zmike> if it's a hw implementation then the vendor has to have submitted a conformant driver

18:27 <zmike> so like zink+anv would need a conformant intel GL driver

18:27 <zmike> from the company itself

18:27 <zmike> (iris or proprietary in this case since both are owned by intel)

18:27 <karolherbst> sure

18:27 <zmike> so yes, it could be mesa

18:28 <karolherbst> but would zink+anv and zink+radv be enough or would you have to use a different vk driver

18:28 <zmike> if it meets the CL requirements it'd be fine

18:28 <karolherbst> okay

18:28 <zmike> I only know the GL ones

18:28 <zmike> just mail neil and ask, not sure there's been many conformant layered impls

18:29 heat_ has quit [Read error: Connection reset by peer]

18:29 <zmike> or whoever the CL chair is

18:29 heat has joined #dri-devel

18:30 <karolherbst> mhhh

18:30 <karolherbst> for CL they don't specify much

18:30 <karolherbst> I can ask on the next meeting though

18:31 <karolherbst> but yeah, neil still is the chair

18:31 <karolherbst> ohh wait, they say two

18:31 <karolherbst> "For layered implementations, for each OS, there must be Successful Submissions using at least two (if

18:31 <karolherbst> available) independent implementations of the underlying API from different vendors (if available), at

18:31 <karolherbst> least one of which must be hardware accelerated."

18:32 <zmike> sounds like same as GL

18:32 <karolherbst> yeah, but different vendors :)

18:32 <karolherbst> so two mesa drivers won't cut it

18:32 <karolherbst> I guess?

18:32 <zmike> pretty sure they will

18:32 <karolherbst> dunno

18:32 <karolherbst> I can ask

18:32 <zmike> but again, ask neil

18:33 <karolherbst> anyway... I should upstream some of the patches, we can leave the vec8/16 lowering out a bit, because that's going to be pain to get right...

18:33 <karolherbst> but seems like my current attempt works pretty well

18:38 An0num0us has joined #dri-devel

18:47 <airlied> Lynne: was waiting for Benjamin to update the tags, but I suppose I can do that

18:50 benjaminl has joined #dri-devel

18:54 heat has quit [Remote host closed the connection]

18:56 benjamin1 has quit [Ping timeout: 480 seconds]

19:00 <DemiMarie> gfxstrand: I saw https://lore.kernel.org/dri-devel/CAOFGe94JC8V2GS5L2iCaD9=X-sbbcvrvijck8ivieko=LGBSbg@mail.gmail.com/ and it makes so much more sense why you are pushing for userspace command submission, despite the security concerns.

19:02 <gfxstrand> Modern AMD/NV/Intel hardware really is built for userspace submit

19:02 <gfxstrand> Even modern Mali tries

19:02 <DemiMarie> And from my (Qubes) perspective userpace submit is absolutely awful

19:03 <gfxstrand> Eh, yes and no.

19:03 <gfxstrand> userspace submit doesn't really increase the surface that much

19:04 <DemiMarie> Can you elaborate?

19:04 <DemiMarie> Because this is something that has worried me for quite a while.

19:04 <gfxstrand> The moment you turn on the GPU, you have a massive pile of hardware that's potentially an attack surface.

19:04 <gfxstrand> The thing that saves you is page tables. The kernel controls those and every bit of GPU memory access goes through them.

19:05 <DemiMarie> So basically the same as a CPU?

19:05 <Lynne> airlied: thanks, the ffmpeg side of the patch fixes nvidia, but I'd rather have at least one fully working implementation at any one time

19:05 <gfxstrand> DemiMarie: Pretty much.

19:06 <DemiMarie> gfxstrand: from my perspective, the big difference is how much of the GPU firmware is exposed to untrusted input

19:07 <gfxstrand> Yeah, so that's the thing. How much untrusted input are we talking about? On something like NVIDIA, it's just a bunch of 64b pushbuf packets with maybe another type of packet for signaling a semaphore.

19:07 <gfxstrand> Those themselves are probably pusbufs (I'm not actually sure)

19:07 <DemiMarie> Interesting

19:08 <gfxstrand> On Intel, it's the same command streamer that is used for processing graphics commands. It just has a 3 level call/return stack.

19:08 <gfxstrand> You can put 3D rendering commands in level0 if you want.

19:08 <gfxstrand> IDK how AMD's works.

19:09 <gfxstrand> The only additional thing you're really exposing is the doorbells.

19:09 <DemiMarie> Does that mean that on Intel there is no difference at all attack surface wise?

19:09 <DemiMarie> Doorbells are actually a problem for virtualization.

19:09 <gfxstrand> Well, it depends a bit on the doorbell situation.

19:09 <gfxstrand> If all you expose is the ring itself and you leave the kernel in charge of doorbells, then I don't think the attack surface is different.

19:10 <DemiMarie> You can’t expose doorbells (or any MMIO) to untrusted guests due to hardware bugs on Intel

19:10 <gfxstrand> Sure

19:10 <gfxstrand> But by "doorbells" I think I mean a regular page of RAM.

19:10 <DemiMarie> Specifically to fully untrusted guests — a guest with access to MMIO can hang the host

19:10 <gfxstrand> There's details here I'm missing.

19:10 <gfxstrand> If you expose doorbells (I don't remember most of the details), then you're opening up a bit more surface but probably not much.

19:11 <DemiMarie> How could it be a regular page of RAM? Does the firmware just poll for input on its various queues without relying on interrupts?

19:11 <gfxstrand> I think so

19:11 <gfxstrand> But, again, I don't remember all the details.

19:12 <DemiMarie> That’s the other part that is potentially problematic, where by “problematic” I mean “scary”

19:13 <DemiMarie> The GPU drivers are open source software and at least some of them (Intel IIRC) are being fuzzed continuously by Google for ChromeOS.

19:13 <DemiMarie> So while I can’t guarantee that it is perfect, it likely at least has no blatently obvious flaws.

19:14 <gfxstrand> Where Intel gets scary is the FW interop queues. That's where you can send actual commands to the firmware to do various things. IDK if those are ever expected to be exposable to userspace or if they're always privileged. (Details of GuC interaction are not my strong suit.) If they are, then that's an actual attack surface. If all that is marshalled through the kernel or hypervisor, then I think

19:14 <gfxstrand> you're fine.

19:14 <DemiMarie> Are they performance critical?

19:14 <DemiMarie> Hopefully they are always privileged.

19:14 <gfxstrand> No, I don't think they're perf critical.

19:14 <DemiMarie> Then keeping them privileged should be fine.

19:15 <sima> the guc queues are for less privileged guest kernel drivers, but not for userspace

19:15 <DemiMarie> Is GPU firmware any scarier than CPU microcode?

19:15 <gfxstrand> That's what you have to use to susped a queue if you need to swap out memory or something.

19:15 <sima> so still attack surface

19:15 <gfxstrand> And when you run out of doorbells, you also have to use them.

19:15 <DemiMarie> sima: depends on whether you are using SR-IOV or virtio-GPU native contexts

19:15 <sima> DemiMarie, well for sriov

19:15 <DemiMarie> sima: yeah

19:16 <DemiMarie> native contexts have the advantage that they could work on AMD and Nvidia hardware too

19:16 guru__ has joined #dri-devel

19:16 <gfxstrand> If you're doing virtio-GPU type thing then you would just expose the ring and maybe a doorbell page to userspace and pass that straight throug to the userspace in the VM.

19:16 benjaminl has quit [Ping timeout: 480 seconds]

19:16 <gfxstrand> Suddenly the only thing any kernel has to get involved with is page migration.

19:16 <DemiMarie> gfxstrand: interesting

19:17 <DemiMarie> The native contexts I was thinking of are basically ioctl-level passthrough

19:17 <DemiMarie> <gfxtrand> If you're doing virtio-GPU type thing then you would just expose the ring and maybe a doorbell page to userspace and pass that straight throug to the userspace in the VM.

19:17 <DemiMarie> That’s an interesting idea.

19:18 <DemiMarie> sima: do the GPU vendors do any sort of internal security reviews of their firmware?

19:20 <DemiMarie> gfxstrand: thanks, this makes me feel better

19:22 guru_ has quit [Ping timeout: 480 seconds]

19:22 <DemiMarie> Honestly, GPU hardware is actually quite a bit nicer than CPU hardware in that it doesn’t have to have obscene amounts of speculative execution to get good performance.

19:23 <DemiMarie> Nor does it need to remain compatible with 20-year-old binaries.

19:23 <gfxstrand> heh

19:24 <bcheng> airlied: Lynne: ah give me one moment I can update the tags

19:24 * DemiMarie tries to figure out if this is the first part of a longer series of comments

19:26 <HdkR> DemiMarie: Don't worry, ARM hardware is fine with cutting out 20 years of cruft

19:26 <DemiMarie> HdkR: and that is a good thing, but they still have the speculation issues

19:26 <airlied> bcheng: oh cool I haven't moved much yet

19:27 <DemiMarie> overall, how does GPU driver attack surface compare to CPU hypervisor attack surface?

19:28 <sima> DemiMarie, not sure I should talk about that :-)

19:28 <DemiMarie> sima: why?

19:28 <sima> I work for such a vendor ...

19:28 <sima> pretty sure I'm breaking all the rules if I disclose anything about the hw security process here

19:29 <bcheng> airlied: I added some new stuff since your last review, including moving some stuff of gallium into util in case you missed it.

19:29 <sima> like pretty much everything is cool, except hw security issues is absolute no fun or the CEO will know your name if you screw up

19:29 <DemiMarie> sima: okay, fair

19:30 <DemiMarie> so I will phrase things this way: to what degree would giving a guest access to GPU acceleration make life easier for an attacker?

19:31 <bcheng> airlied: it's pretty trivial stuff, but not sure if someone else needs to ack it or something

19:32 <airlied> my assiging it to marge is implicitly ack enough :)

19:32 <DemiMarie> and can I trust that the company you work for would take a security problem with their GPUs (be it hardware or firmware) just as seriously as they would take a problem with their CPUs?

19:32 <airlied> but I'll reread it now

19:32 <DemiMarie> because those two things are ultimately what my users care about sima.

19:34 <airlied> bcheng: just missed a license header on the new util c file

19:34 <sima> intel's fixed a ton of device hw security bugs too over the past years (starting with the cpu smeltdown mess more or less), anything more is way beyond my pay grade

19:35 <sima> (i.e. don't ask more questions, I wont answer them anyway)

19:35 <DemiMarie> fair

19:36 <bcheng> airlied: ah ok, and I assume because its copy-pasted I should just copy the license from there?

19:36 fab has quit [Quit: fab]

19:36 <sima> (the device hw security fixes have been in a pile of press announcements, otherwise I couldn't tell you about those either)

19:37 <DemiMarie> sima: “the guc queues are for less privileged guest kernel drivers, but not for userspace” tells me enough

19:38 <DemiMarie> (in a good way)

19:38 <DemiMarie> “(i.e. don't ask more questions, I wont answer them anyway)” — thanks for making this explicit

19:38 <sima> yeah like I said, the rules around hw security bugs are extremely strict, can't even talk about them freely internally

19:39 <DemiMarie> yikes

19:39 <sima> if you're not on the list of people who know for a specific hw bug, you don't get to talk about it

19:43 crabbedhaloablut has quit []

19:43 Danct12 has quit [Quit: A-lined: User has been AVIVA lined]

19:47 <DemiMarie> what matters most for me (and likely for my users) is that the GPU is considered a defended security boundary, which it obviously is

19:47 <DemiMarie> (and to be clear: I do not expect a reply sima, so do not feel bad for being unable to give one)

19:55 <DemiMarie> Also, if I made you uncomfortable, I apologise.

20:00 junaid has quit [Remote host closed the connection]

20:00 <DemiMarie> sima: thank you for what you did say, it really does give me confidence in Intel’s handling of hardware vulns.

20:06 gouchi has quit [Quit: Quitte]

20:09 Danct12 has joined #dri-devel

20:11 benjaminl has joined #dri-devel

20:12 Surkow|laptop has quit [Quit: 418 I'm a teapot - NOP NOP NOP]

20:14 Surkow|laptop has joined #dri-devel

20:23 <Lynne> bcheng: just copy paste the license and keep the headers the same

20:27 rosasco has joined #dri-devel

20:33 Duke`` has quit [Ping timeout: 480 seconds]

20:36 sima has quit [Ping timeout: 480 seconds]

20:39 rosasco has quit [Remote host closed the connection]

21:03 Haaninjo has quit [Quit: Ex-Chat]

21:33 guru__ has quit [Ping timeout: 480 seconds]

21:43 An0num0us has quit [Remote host closed the connection]

21:48 greenjustin_ has quit [Ping timeout: 480 seconds]

21:57 crcvxc has joined #dri-devel

22:06 nchery is now known as Guest431

22:06 nchery has joined #dri-devel

22:08 TMM has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

22:08 TMM has joined #dri-devel

22:11 oneforall2 has joined #dri-devel

22:24 vjaquez has quit [Quit: ¡hasta luego!]

22:26 vjaquez has joined #dri-devel

22:27 oneforall2 has quit [Remote host closed the connection]

23:00 pcercuei has quit [Quit: dodo]

23:00 tristianc6704 has joined #dri-devel

23:03 nchery is now known as Guest434

23:03 nchery has joined #dri-devel

23:04 nchery has quit []

23:05 shashanks_ has joined #dri-devel

23:06 youmukonpaku133 has joined #dri-devel

23:08 youmukonpaku133 has quit [Remote host closed the connection]

23:08 youmukonpaku133 has joined #dri-devel

23:09 youmukonpaku133 has quit []

23:09 youmukonpaku133 has joined #dri-devel

23:10 Guest434 has quit [Ping timeout: 480 seconds]

23:11 shashanks__ has quit [Ping timeout: 480 seconds]