ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
<karolherbst> dcbaker: seems to be flaky..
<karolherbst> ehh wait
<karolherbst> there are also fails in the test I've written.. uhh
<karolherbst> huh.. something is weird
<karolherbst> heh.. that generated wrapper doesn't include the origina header file... maybe something changed with an updated bindgen
<karolherbst> I'll look into that tomorrow
<karolherbst> and ubuntu "rolling" doesn't have a new enough bindgen...
<karolherbst> *sigh*
<dcbaker> karolherbst: soon we’ll be able to build bindgen in the build process, lol
<HdkR> Oh wow, how long would that take?
<HdkR> That sounds like my idea of forcing people to rebuild a clang fork :D
<soreau> HdkR: upstream the patches, then you wont have to ;)
<HdkR> :)
<kurufu> Since anv exports yuv formats a single plane, is that essentially set in stone now? Asking as vulkan video makes that format more popular the single plane format seems to make lots of things harder (if only because the world expects 3 planes).
<Sachiel> what do you mean? The number of planes depends on the specific format, no?
<kurufu> Despite the format being x planes, drivers dont neccessarily need to export a representation with multiple planes.
<Sachiel> An image’s memory planecount (as returned by drmFormatModifierPlaneCount) is distinct from its format planecount (in the sense of multi-planar Y′CBCR formats). In VkImageAspectFlags, each VK_IMAGE_ASPECT_MEMORY_PLANE_i_BIT_EXT represents a memory plane and each VK_IMAGE_ASPECT_PLANE_i_BIT a format plane.
<Company> fun fact: Disabling optimizations when generating spirv with glslang/glslc can result in significantly faster shader code
<Company> almost as fast as zink now
<daniels> soreau: could you please file an MR?
<soreau> daniels: Now that I thought about it, it needs zmike's MR to be merged first since it's against the egl branch of !24700
<soreau> daniels: also I'd like some feedback on fixing the crash with resizing weston-simple-egl regardless of vblank_mode/swap_interval
<soreau> in that trace, wsi_wl_surface is NULL because it's using the old swapchain after it has been replaced with a new swapchain but before it was pruned
<soreau> attempts to simply prune before it was used were unsuccessful
<karolherbst> HdkR: building bindgen takes like a minute
<karolherbst> dcbaker: okay.. so the arch container simply had bindgen 0.64 and there the static inline stuff is mostly broken..
<daniels> soreau: the patch looks reasonable to me, but it would need zmike to comment on it since the flush_queue is zink-specific
<soreau> daniels: unfortunately, zmike said he can't reproduce the resize crash
<soreau> but no telling what sort of dragons he has installed ;)
<soreau> or maybe it's only on radv? idk which chip he tried
<daniels> right, but he's the one who knows whether flushing the flush queue at that point is theoretically correct or not
<daniels> I didn't even know zink had its own internal queue of stuff
<Wallbraker> Does do anything? Is it a newish variable?
<pixelcluster> Wallbraker: it does something on 23.2 and newer
<pixelcluster> the different trace modes have had different environment variables to enable before that, but docs are generated from latest main
<Wallbraker> Ah yes, I'm on 23.1.7, that clears that up. Really those things should have version tags so you know for which version they apply to.
<zmike> soreau: I slept on it and I think is what you need
<soreau> zmike: Thanks, I'll try it. Were you ever able to reproduce the resize crash? (on radv?)
<zmike> eventually
<soreau> super good
<DavidHeidelberg[m]> running GPT-2 on iris (TGL): ~ 112ms per query; llvmpipe (i7-1185G7 @ 3.00GHz): ~150ms (but CPU heats up quickly, so power consumption would be very different)
<DavidHeidelberg[m]> outcome: OpenCL may be not best, but still better than CPU on integrated GPU with shared memory for LLVM
<DavidHeidelberg[m]> *LLM damn
<soreau> zmike: yes it works
<soreau> zmike: did you get a chance to glance at the swap interval patch?
idr has joined #dri-devel
<karolherbst> DavidHeidelberg[m]: wondering how good it runs on Intel's stack
<karolherbst> but yeah..
<karolherbst> on Intel's iGPUs it's mostly about power efficiency
<karolherbst> and.. there might be some overhead randomly :D
<karolherbst> always hard to tell with tests running quickly
kts has joined #dri-devel
<zmike> soreau: no, but I'm also probably not the ultimate reviewer for such a thing
<DavidHeidelberg[m]> karolherbst: RUSTICL_ENABLE=iris GPU=1 PYTHONPATH="." JIT=1 python examples/ --model_size=gpt2 --prompt "Hello." --count 46 --temperature 0 --timing
<karolherbst> heh...
<karolherbst> DavidHeidelberg[m]: what repo do I need to clone?
<karolherbst> ahh, it's tinygrad
<soreau> zmike: ok
<karolherbst> DavidHeidelberg[m]: looks like intel is a bit faster :D but I'm also using a debug build
<karolherbst> What is Rusticl?: Rusticl is a library for building and manipulating Rusticl objects. It is a library for building and manipulating Rusticl objects. It is a library for building and manipulating Rusticl objects. It is a
<karolherbst> oh well..
<karolherbst> but that question breaks with rusticl
<karolherbst> "ValueError: probabilities do not sum to 1" :D
<karolherbst> guess I might want to fix that
<karolherbst> works with llvmpipe...
<DavidHeidelberg[m]> Interesting :D
<DavidHeidelberg[m]> Usually so far rusticl w/ iris worked best for me
<karolherbst> I break it way too easily
<karolherbst> but yeah...
<karolherbst> I'm sure it's a 0.9999 vs 1.0 thing
<karolherbst> and some precision is slightly off
<karolherbst> let's see...
<karolherbst> yeah.. works with count 45 :D
<DavidHeidelberg[m]> Iris doing trolololo
<DavidHeidelberg[m]> Which gen?
<karolherbst> but yeah.. intels stack is significantly faster.. I guess there is something I should optimize then
<karolherbst> 9.5
<karolherbst> "Rusticl is an EULA agreement-based and open source project which aims to provide a framework for building web apps based on Rust. This means that the project is open source."
<daniels> well, those are certainly all words
<karolherbst> it's fascinating how sure it is and how much garbage it is saying
<karolherbst> *sure of itself"
<karolherbst> probably something debug build
<DavidHeidelberg[m]> it's GPT-2, with llama (but it takes 13G of VRAM) it's "sometimes" reasonable
<karolherbst> clinfo is also 3x quicker with release builds
<DavidHeidelberg[m]> yeah, I'm using Debian nightly builds (so release)
<karolherbst> ahh
<tnt> I should be on a release build too
<karolherbst> anyway.. I'm sure there are optional CL extensions or something
<karolherbst> or the runtime overhead is just too high
<karolherbst> but glad to know that it also produces same nonsense with Intel
<karolherbst> so the math seems to be alright
<karolherbst> nice.. seems to work with zink as well... kinda
<karolherbst> I kinda have to find a solution for the gpu getting reset too quickly
<karolherbst> and I get tons of "MESA: error: zink: couldn't allocate memory: heap=4 size=20102"
<karolherbst> oh well...
<karolherbst> okay.. yeah it's mostly GPU side things ...
<karolherbst> and some memcpies, but whatever
<karolherbst> 83% is just the python runtime
<karolherbst> GPU is 90% busy, so there is that
<karolherbst> and it's always creating a context with profiling enabled.. oh boi.. anyway, I suspect we can be more optimized on the kernel side somewhere
<karolherbst> mhhh.. maybe I have some synchronization bug somewhere... zink is also hitting something all the time
<karolherbst> oof... ralloc_asprintf accounts for 33% of my launch kernel overhead...
<karolherbst> "SIMD16 skipped because workgroup size 1 already fits in SIMD8" getting tons of those
<karolherbst> or "SIMD32 skipped because workgroup size 16 already fits in SIMD16"
<karolherbst> Kayden: ^^ seems like this is a significant CPU overhead for launching compute jobs
<karolherbst> like.. the biggest part of it
<karolherbst> might make sense t skip those ralloc_asprintf calls
<karolherbst> inside brw_simd_should_compile that is
<jrpan> Hi, I'm new to graphics rendering and drivers and please forgive me for asking dumb questions. I've been playing with the vulkan intel driver. I see that when anv_cmd_buffer_bind_descriptor_set is called, the descriptor is saved to the cmd_buffer at cmd_buffer->state.gfx.base->descriptors[set_index].
<jrpan> If there are multiple draws in a command buffer, and for each draw, new descriptors are binded, how are the desciptors being distinguished between drawcalls that are within the same command buffer (when the buffer is submitted to GPU)?
<jrpan> Becuase it's the same command buffer and the later binded descriptor would just overwrite previous binded descriptor. Or my understanding is just wrong?
<jrpan> I just want to reference all descriptors that are being used in a command buffer at queuesubmit. But so far I can only get the last descriptor used.
<pendingchaos> mostly likely vkCmdDraw/etc read from state.gfx.base->descriptors to create commands that actually bind the descriptors
<jrpan> So at vkCmdDraw, the state.gfx.base->descriptors is being "saved" to the vkcmdDraw?
<Sachiel> anv_CmdBindPipeline in anv_cmd_buffer.c takes care of that
<mareko> tarceri_: I've noticed that gl_program::sh::UniformBlocks is populated before UBO linking for GLSL, but not SPIR-V. If I add UBOs before UBO linking, do I also need to update gl_program::sh::UniformBlocks?
<Kayden> karolherbst: that isn't handled by shader cache?
<karolherbst> Kayden: nope, that's all at the SIMD selection level
<karolherbst> Kayden: iris_launch_grid -> iris_upload_compute_state -> iris_upload_compute_walker -> brw_cs_get_dispatch_info -> brw_simd_select_for_workgroup_size -> brw_simd_should_compile -> ralloc_asprintf
<karolherbst> guess that's only hit if the workgroup size is variable...
donaldrobson has quit [Ping timeout: 480 seconds]
ngcortes has joined #dri-devel
<karolherbst> it kinda feels all a bit suboptimal
<karolherbst> but I can't really put my finger on it yet
<karolherbst> ehh
<karolherbst> nvm, maybe I should check what kernels are actually used often
<karolherbst> but I'm wondering if there are any magic intel isntructions we might want to wire up to speed up matrix multiplications
<karolherbst> I kinda need GPU perf profiling tools...
pannage has joined #dri-devel
vliaskov has quit [Remote host closed the connection]
<anholt> hakzsam: I've got some wip ci stuff that I'd like to have the 1.3.6 cts for. are you still working on polishing that today?
<anholt> hakzsam: I guess we're waiting for !25284 for cts uprev?
<karolherbst> okay
<karolherbst> I have a fun idea for an optimization of compute kernels
<karolherbst> so all threads execture the lower loop, which is doing the exact same thing in all threads
<karolherbst> and stores the same result at the same location in all threads
<karolherbst> I wonder if we could just vectorize or optimize that code in a way, that it's not doing something silly like this
<karolherbst> mhhh
<karolherbst> this entire loop could be an atomic... actually.. the entire shared memory array could be one atomic
<karolherbst> I wonder how terrible it would be to match this pattern...
<airlied> uggh yeah I wonder if tinygrad itself could do better in it's generator code there
<karolherbst> the pain point is that it is shared code with the CPU stuff...
<karolherbst> somehow
<karolherbst> but I do wonder if we actually just want to match that pattern, because it's actually not _that_ complicated
<karolherbst> the loop is uniform, which we should be able to proof
<karolherbst> and if every thread stores it at the same location, we should do something smarter
<karolherbst> yeah
<karolherbst> `con 32 %54 = fmul! %62, %53 (0.000781)`
<karolherbst> so yeah.. nir detects it as uniform
<karolherbst> the biggest question is simply to what should we optimize this
illwieckz has joined #dri-devel
<karolherbst> maybe subgroup ops and just cut that loop by the subgroup size?
<karolherbst> but yeah.. if LLVM is able to optimize this to something not stupid, no wonder Intel is faster
