OftenTimeConsuming has quit [Remote host closed the connection]
OftenTimeConsuming has joined #dri-devel
luc has quit [Quit: Page closed]
luc has joined #dri-devel
Leopold_ has quit [Remote host closed the connection]
Leopold has joined #dri-devel
co1umbarius has joined #dri-devel
columbarius has quit [Ping timeout: 480 seconds]
oneforall2 has quit [Remote host closed the connection]
oneforall2 has joined #dri-devel
yyds has joined #dri-devel
DragoonAethis has quit [Quit: hej-hej!]
DragoonAethis has joined #dri-devel
yuq825 has joined #dri-devel
mripard_ has joined #dri-devel
Namarrgon has joined #dri-devel
mripard has quit [Ping timeout: 480 seconds]
angerctl has quit [Ping timeout: 480 seconds]
yyds has quit []
yyds has joined #dri-devel
<luc>
cwabbott: I noticed that glibc starts using simd/fp registers since https://sourceware.org/git/?p=glibc.git;a=commit;h=e6f3fe362f1aab78b1448d69ecdbd9e3872636d3. but my test is on the older glibc version (2.31) which does NOT contain those instructions like `ldr q0, [src]`. so I guess __memcpy_aarch64_simd() might not faster than normal LD/ST if destination of memcpy is vram
luc has quit [Quit: Page closed]
<HdkR>
ldq q0 is asimd
<HdkR>
Oh, does not contain those
luben has joined #dri-devel
neniagh has quit [Ping timeout: 480 seconds]
heat has quit [Ping timeout: 480 seconds]
neniagh has joined #dri-devel
YuGiOhJCJ has quit [Ping timeout: 480 seconds]
lanodan has quit [Ping timeout: 480 seconds]
bbrezillon has quit [Ping timeout: 480 seconds]
aissen has quit [Ping timeout: 480 seconds]
aissen has joined #dri-devel
bbrezillon has joined #dri-devel
bmodem has joined #dri-devel
kts has joined #dri-devel
luben has quit [Ping timeout: 480 seconds]
Haaninjo has joined #dri-devel
Haaninjo has quit []
kts has quit [Quit: Leaving]
kts has joined #dri-devel
psykose has quit [Remote host closed the connection]
ptrc has quit [Remote host closed the connection]
ptrc has joined #dri-devel
psykose has joined #dri-devel
itoral has joined #dri-devel
sadlerap3 has quit [Remote host closed the connection]
sadlerap3 has joined #dri-devel
kts has quit [Ping timeout: 480 seconds]
urja has quit [Ping timeout: 480 seconds]
mripard_ has quit []
mripard has joined #dri-devel
mszyprow has joined #dri-devel
kzd has quit [Ping timeout: 480 seconds]
jsa has joined #dri-devel
sima has joined #dri-devel
sghuge has quit [Remote host closed the connection]
sghuge has joined #dri-devel
fab has joined #dri-devel
mvlad has joined #dri-devel
macslayer has quit [Ping timeout: 480 seconds]
tursulin has joined #dri-devel
<MrCooper>
karolherbst: BTW, any reason rusticl on radeonsi couldn't be tested in CI?
oneforall2 has quit [Ping timeout: 480 seconds]
oneforall2 has joined #dri-devel
Leopold has quit [Remote host closed the connection]
Leopold has joined #dri-devel
hansg has joined #dri-devel
Leopold has quit [Remote host closed the connection]
Leopold has joined #dri-devel
rgallaispou has joined #dri-devel
lynxeye has joined #dri-devel
thaytan has quit [Ping timeout: 480 seconds]
<airlied>
MrCooper: getting CL cts into a useful form, though we could just pick a couple of main tests
<MrCooper>
I was thinking of piglit
<MrCooper>
that should have caught this regression and a few before at least
vliaskov has joined #dri-devel
simon-perretta-img has quit [Ping timeout: 480 seconds]
simon-perretta-img has joined #dri-devel
crabbedhaloablut has joined #dri-devel
i509vcb has quit [Quit: Connection closed for inactivity]
<pq>
jani, what if you forward-declare an enum, use it in struct definition, and then include the definition of the enum which results in a different size? Or maybe you just copy the struct without ever having the enum definition?
pcercuei has joined #dri-devel
<emersion>
pq: it appears the compiler remembers the enum has unknown size, and will require a size definition before you can do these things
<jani>
pq: it's an incomplete type similar to a struct/union forward declaration, and can't use it before you know the size
simon-perretta-img has quit [Ping timeout: 480 seconds]
Leopold has quit [Remote host closed the connection]
<jani>
I know it's a bit hacky, but if you can use it to avoid pulling in some headers everywhere, I'll use it
<emersion>
it's non-standard, so i won't use it
simon-perretta-img has joined #dri-devel
<jani>
fair, though the kernel is explicitly non-standard
<jani>
I'd also avoid it outside of the kernel
<emersion>
yeah
heat has joined #dri-devel
Nefsen402 has quit [Remote host closed the connection]
bl4ckb0ne has quit [Remote host closed the connection]
emersion has quit [Remote host closed the connection]
bl4ckb0ne has joined #dri-devel
Nefsen402 has joined #dri-devel
emersion has joined #dri-devel
sukrutb has quit [Ping timeout: 480 seconds]
pjakobsson has quit [Remote host closed the connection]
cmichael has joined #dri-devel
drobson has joined #dri-devel
biju has joined #dri-devel
vliaskov has quit [Read error: Connection reset by peer]
janek has joined #dri-devel
pa- has quit []
pa has joined #dri-devel
drobson has quit [Ping timeout: 480 seconds]
Leopold has joined #dri-devel
DodoGTA has quit [Quit: DodoGTA]
DodoGTA has joined #dri-devel
kts has joined #dri-devel
kts has quit []
DodoGTA has quit [Quit: DodoGTA]
DodoGTA has joined #dri-devel
sgm has quit [Remote host closed the connection]
sgm has joined #dri-devel
kts has joined #dri-devel
kts has quit []
rasterman has joined #dri-devel
columbarius has joined #dri-devel
co1umbarius has quit [Ping timeout: 480 seconds]
kts has joined #dri-devel
<tomba>
Looking at the atomic helpers, the current sequence when enabling the video pipeline is: crtc enable, bridge pre_enable, encoder enable, bridge enable. Crtc's enable happening before the bridge's pre_enable strikes me a bit odd, especially as bridge's pre_enable documentation says "The display pipe (i.e. clocks and timing signals) feeding this bridge will not yet be running when this callback is called". Anyone have insight on why the sequence has
<tomba>
evolved to be such? Does the DRM framework expect that there's always an encoder which will somehow gate the signal from CRTC, until the encoder enable is called?
kts has quit [Ping timeout: 480 seconds]
<karolherbst>
MrCooper: not really
<karolherbst>
piglit is better than nothing I guess, but I also kinda want the CL CTS to be tested, it's just a pain to do
<mupuf>
karolherbst: what's different about cl vs gl/gles/vk?
<karolherbst>
and it's unknown what comes out of it
<mupuf>
Ah, UXL, not CXL
<mupuf>
that was confusing ;)
<karolherbst>
ahh yeah.. my bad
<mupuf>
np
<mupuf>
too many three letter acronyms
<mupuf>
Thanks Karol, keep up the good work!
<karolherbst>
:) thanks
<karolherbst>
but yeah...
<karolherbst>
I kinda wnat to test the CL CTS in CI
<karolherbst>
actually I kinda like my deqp adapter idea....
<mupuf>
that would be an easy path forward, yeah
<karolherbst>
just also slow if it simply would `execv` the binaries...
<mupuf>
deqp is extremelly slow at test enumeration
<mupuf>
so don't worry
<karolherbst>
I wonder if one could `dlopen` them and just call into their `main` function instead...
<mupuf>
how many tests are there, and how long is a typical runtime?
<karolherbst>
all tests in non wimpy mode are like between 3 and 70 hours apparently
<karolherbst>
70 as that's what jenatali needed for a full CL CTS run
<karolherbst>
wimpy reduces amount of iteration in arithmetic tests
<karolherbst>
if I pass `wimpy` and `quick` into my runner it's like 10 minutes parallized
<karolherbst>
but it skips a bunch of corner case tests
<karolherbst>
subtests without taking image formats/order into account is roughly 2500
<mupuf>
I see
<karolherbst>
but you could split those up if you wanted to
<karolherbst>
probably around 5000 then
<mupuf>
sure, but let's just say that it is a little silly for it to take that long
<mupuf>
just like igt used to take a month to run
<karolherbst>
it's testing a lot of stuff
<karolherbst>
like
<karolherbst>
arithmetic precision
<karolherbst>
and it iterates over a bunch of random values just to make sure the runtime is okay there
<mupuf>
right, but knowing what to test is an important thing too
<karolherbst>
yeah
<mupuf>
I'm sure the wimpy mode is a good start anyway
<karolherbst>
yeah
<karolherbst>
it's good enough
<karolherbst>
it doesn't catch all subnormal/nan related corner case, but whatever
<mupuf>
10 minutes means it could run on one runner
<karolherbst>
that's 10 minutes on a 20 core machine
<mupuf>
is it cpu-limited?
<karolherbst>
yes
<karolherbst>
compiling a lot of C code
<karolherbst>
well some tests
<karolherbst>
some tests are GPU limited
<mupuf>
we have 16 cores runners
<karolherbst>
I could do a 4 core run and see how that changes things
<mupuf>
(5950X, for navi21/31)
<mupuf>
4 cores run? It would only make sense for lavapipe
camus has quit [Remote host closed the connection]
<mupuf>
For bare metal runners, our slowest runners are the steam decks
<karolherbst>
yeah... I just want to see how slong things become if you limit on the CPU side aggressively
<karolherbst>
*slow
<karolherbst>
but it's kinda wild.. some test utilize the GPU at 100%, others at like 0.1%
<karolherbst>
and most isn't even runtime, it's just validation on the CTS side
<karolherbst>
as you know... it calculates the same thing also on the CPU for checking the result
<karolherbst>
at least on intel CPU util idles around 8% while running the CTS :')
<karolherbst>
(with 4 cores)
<karolherbst>
now it's 0% :')
kts has joined #dri-devel
yyds has quit [Remote host closed the connection]
<mupuf>
ha ha, you are latency bound :D Everything just keeps waiting on each other
<karolherbst>
mhhh
<karolherbst>
doubt
<karolherbst>
my cores are still at 400% in userspace
<karolherbst>
ehh
<karolherbst>
each one at 100%
<karolherbst>
I mean.. the CL CTS for 90% of its time runs the same code on the CPU and checks if the result is correct, so if it wouldn't be CPU bound it would be kinda sad
<pq>
I suspect a simple misread idle vs. usage % :-)
<karolherbst>
but it kinda depends on the tests, some are more GPU bound, for non optimal code reasons
<karolherbst>
pq: no seriously.. some of the tests are just doing 100% math
<karolherbst>
if you validate multiple 10 thousands results from e.g. `sin` that's what you get
<karolherbst>
(and all the other builtins)
<pq>
karolherbst, I think mupuf read "usage" when you said "idle".
<karolherbst>
ahh I see
<mupuf>
pq: indeed
<karolherbst>
ehh wait
<karolherbst>
I wrote CPU
<karolherbst>
I meant GPU
<karolherbst>
duh
<pq>
lol
<mupuf>
makes more sense :p
<karolherbst>
my fault 🙃
<karolherbst>
mhh but yeah.. limiting to 4 cores kinda doubles the CL CTS runtime on my intel GPU
<karolherbst>
could be worse
bmodem has quit [Ping timeout: 480 seconds]
<karolherbst>
but we also have a wimpy factor option in some tests, which just adjusts how many iterations are done
<karolherbst>
anyway.. maybe I'll play around with the deqp idea and see how bad it would be
<mupuf>
karolherbst: sounds like a good idea. deqp support is found in a lot of tools, so if you can keep compatibility to it, it would be easiest
<karolherbst>
yeah..
<mupuf>
maybe it could even land in clcts, but that's not a requirement
<karolherbst>
I wonder how hard it would be to convert all those testing to deqp actually, maybe I should bring it up at the CL WG next year as well.. but that kinda requires everybody else wanting it also :D
<karolherbst>
but migrating the entire code base is probably quite a bit of work
bmodem has joined #dri-devel
<tomeu>
haven't been following, but if somebody is going to make any big changes to the CTSes, it would be great if caching of golden results was taken into account
<tomeu>
once I implemented that in my test suite, I started finding concurrency bugs in the kernel driver...
itoral has quit [Remote host closed the connection]
<mupuf>
tomeu: I guess for images, it can make sense... but for precision/arithmetics tests, I doubt caching would improve performance :s
<tomeu>
well, anything that computes something expensive in the CPU that is used to compare it with the driver's output
<tomeu>
guess that is the case if it's CPU bound
<mupuf>
it doesn't have to be expensive, it can be that there are just too many tests. Imagine a test that check that the gpu can increment a variable... for every format and for every acceptable value in the format
<mupuf>
that seems to be what clcts is doing in some cases... not much to cache there
<mupuf>
but... maybe the issue is that the tests are stupid :D
<mupuf>
and instead a lot of operations could be tested at the same time, and only decomposed if the final result doesn't match expectations
<mupuf>
but that requires serious work
<mupuf>
glad to you are taking cpu time seriously in the design for the test suite!
<jenatali>
Honestly the CL CTS does at least test things the right way, multithreaded and batching work together. It just does an insane amount of work
yyds has joined #dri-devel
<mupuf>
jenatali: good to hear too!
bmodem has quit [Remote host closed the connection]
bmodem has joined #dri-devel
kts has quit [Ping timeout: 480 seconds]
kts has joined #dri-devel
<Venemo>
Lynne: ping, I am trying to reproduce the multiplanar issue, can you give me a hand, please? it seems your command produces a video file with 1 frame, and I am unsure how best to view it
bmodem has quit [Ping timeout: 480 seconds]
<Venemo>
Lynne: gnome's video app only shows a black screen, and vlc only shows the frame for a very brief time
<Venemo>
Lynne: however, the output seems to be broken with or without enabling the transfer queue, so I am convinced the issue is not in the transfer queue implementation
fab has quit [Quit: fab]
yuq825 has left #dri-devel [#dri-devel]
urja has joined #dri-devel
hansg has quit [Ping timeout: 480 seconds]
<Lynne>
Venemo: sure
janek has quit [Remote host closed the connection]
<karolherbst>
and it has the cooling to run at max clock all the time :)
<karolherbst>
I _could_ check how long it takes on my steamdeck, but then I have to figure out how to do that first
<mupuf>
yeah, not important
<mupuf>
I think expecting about 45-60 would be somewhat accurate
<jenatali>
FL4SHK: you're not authed with NickServ so IRC folks aren't seeing your messages. This is the right place though
FL4SHK has joined #dri-devel
<pq>
emersion, I'd like to wash my hands off that heap naming discussion now, there is nothing useful I could say.
<emersion>
ahah
<emersion>
well thanks for your replies, i think they've been useful
<pq>
glad you think so, I just feel butting in somewhere that's not my business :-p
<FL4SHK[m]>
hello
<pq>
FL4SHK[m], hello, we see you now.
<FL4SHK[m]>
cool
<FL4SHK[m]>
I was wondering, how difficult would it be to develop either a GCC version of LLVMPipe for a custom GPU, (given that I know how to write a GCC backend, which I do) or to develop a new Mesa driver in general?
<FL4SHK[m]>
I am a hardware/software developer and I plan on creating a new FPGA-based workstation.
<FL4SHK[m]>
I would be happy with something like 500 MHz for the system (doable with my Xilinx ZCU104)
urja has joined #dri-devel
<FL4SHK[m]>
This would be a passion project but I'd be happy to open source everything
mszyprow has quit [Ping timeout: 480 seconds]
<FL4SHK[m]>
I typically do that for anything public I make anyway
<FL4SHK[m]>
at least for stuff made outside of work :)
<karolherbst>
FL4SHK[m]: gcc doesn't support those kind of use cases
<karolherbst>
or uhm..
<karolherbst>
mhh
<karolherbst>
maybe with libgccjit.so it actually does...
<karolherbst>
though not sure what input it epxects
<FL4SHK[m]>
I see
<karolherbst>
but it looks very very C centric
<karolherbst>
I think it would be a cool research project in terms of "how good is the gcc jit"
<karolherbst>
but not sure we'd have any plans of merging it unless it has strong benefits over llvmpipe
<FL4SHK[m]>
I see
<FL4SHK[m]>
perhaps I could write an LLVM backend for the GPU then
<karolherbst>
at least having a more stable API/ABI would be a benefit
<FL4SHK[m]>
that would be nice yes
urja has quit [Read error: Connection reset by peer]
<karolherbst>
but in general we kinda prefer having the GPU's backend compiler all inside mesa
<karolherbst>
as doing GPU backends in gcc and llvm kinda... well.. have their disadvantages and we move away from that
urja has joined #dri-devel
<FL4SHK[m]>
okay, right
Haaninjo has joined #dri-devel
<FL4SHK[m]>
well, I was actually hoping to make my GPU have really long vectors instead of making a conventional GPU
<karolherbst>
please don't
<FL4SHK[m]>
otherwise it'd be like the CPU core
<karolherbst>
or rather
<karolherbst>
not inside the ISA
<FL4SHK[m]>
hm?
<karolherbst>
vector ISA have too many drawbacks and everybody moves away from them
<FL4SHK[m]>
why?
<karolherbst>
(if they haven't already)
<karolherbst>
GPU ISA are mostly scalar
<FL4SHK[m]>
I see
<karolherbst>
makes it easier to optimize code
<karolherbst>
so every SIMD lane is just a scalar program from the ISA point of view
<FL4SHK[m]>
I could make a regular GPU then
<karolherbst>
and the hardware just runs e.g. 32 threads in a SIMD group
<karolherbst>
with each executing the same instruction
<karolherbst>
makes it easier to parallize data as you won't run into the issue of "what if you can only use 3 of your 32 SIMD lanes, because of program code"
<karolherbst>
vectorization within a thread is destined to fail
<karolherbst>
so you get the most perf if you don't rely on it
<karolherbst>
however.. some GPUs have e.g. vec2 operations for fp16 or 128 bit wide memory load/stores
<FL4SHK[m]>
I could do that
<FL4SHK[m]>
the ISA I was going to go with is designed to reduce memory traffic
<karolherbst>
but in the end feel free to experiment :D
mszyprow has joined #dri-devel
<karolherbst>
yeah.. so some ISAs have "scalar" or "uniform" registers which are special cases where one instruction inside the SIMD group has the same result across all lanes
<karolherbst>
so optimizations like that exist
<karolherbst>
but that's alu <-> register traffic stuff
shashanks has joined #dri-devel
rgallaispou1 has joined #dri-devel
konstantin_ is now known as konstantin
yyds has quit [Remote host closed the connection]
rgallaispou2 has joined #dri-devel
rgallaispou1 has quit [Read error: Connection reset by peer]
rgallaispou has quit [Ping timeout: 480 seconds]
<Venemo>
Lynne: I still don't see it
<Venemo>
Lynne: what GPU do you use?
<Lynne>
7900XTX, give me a sec to try on a 6900XT
dliviu has quit [Ping timeout: 480 seconds]
<Venemo>
Lynne: I am also trying on 7900XTX
<Venemo>
sorry but the issue just doesn't happen here
<Venemo>
are you sure we are doing the same thing?
<tomeu>
it doesn't need to be this dialect, can be something different, but of equivalent functionality and level of abstraction
mszyprow has quit [Ping timeout: 480 seconds]
<gfxstrand>
tomeu: A bunch of those we already have.
<gfxstrand>
tomeu: The first question that comes to mind is how big is a tensor?
<gfxstrand>
NIR vectors are currently limited to the SPIR-V limit of 16
<gfxstrand>
With some limitations on exactly what sizes are allowed but those can probably be lifted.
<karolherbst>
~~question is for how long still~~
<tomeu>
they can be megabytes big, but I'm not sure a tensor can be mapped as a vector
<gfxstrand>
tomeu: Oh... Okay, that changes my mental model.
<gfxstrand>
So what is a tensor then? Is it an opaque object that's backed by memory somewhere?
<tomeu>
yep, with attributes such as dimensions (4 is common) and data type
<tomeu>
guess I should investigate a bit more what others are doing for generating machine code from MLIR, I just got really frustrated by having to reinvent NIR in my NPU driver
<tomeu>
there is a mlir-to-spirv translator out there, but I'm not sure what is the level of abstraction of the output
<tomeu>
ie. if convolution operations have been lowered to CL spirv or are still there
<Venemo>
Lynne: interesting. it would seem that there are 3 buffer->image and image->buffer copies and the 3rd copy seems to miss a part of the image
<gfxstrand>
tomeu: So, my gut feeling is that if you do it all with intrinsics, find yourself a suffix you're happy with and you can make as many as you'd like.
<tomeu>
ok, I will play with it after holidays and comment back
hansg has joined #dri-devel
<gfxstrand>
It's unclear to me how tensor ops would fit into NIR long-term.
<gfxstrand>
If it's a good match, we may want to add a new op type for tensor ops and make them more first-class.
<gfxstrand>
My biggest fear is that tensor NIR will end up looking so different from regular NIR that we might as well have different IRs.
<gfxstrand>
But I haven't thought about it hard enough for that fear to be an opinion. It's more of a "Hey, there's a mountain over there and I've heard rumors of dragons so, uh... watch out!"
<Lynne>
Venemo: luma plane looks fine, so it's the chroma planes
kzd has joined #dri-devel
Duke`` has joined #dri-devel
<tomeu>
yeah, I also see the tensor type as the main difficulty here
<gfxstrand>
If it remains an opaque thing, that's easy enough.
urja has quit [Ping timeout: 480 seconds]
urja has joined #dri-devel
mszyprow has joined #dri-devel
cmichael has quit [Remote host closed the connection]
macslayer has joined #dri-devel
cmichael has joined #dri-devel
Company has quit [Quit: Leaving]
cmichael has quit []
mvlad has joined #dri-devel
JohnnyonFlame has joined #dri-devel
<Venemo>
Lynne: only the 3rd thingy seems wrong
<Venemo>
but I don't yet see why
nashpa has joined #dri-devel
dliviu has quit [Ping timeout: 480 seconds]
<Lynne>
in case you're wondering why you couldn't replicate on 6.0: it doesn't use multiplane images (ever)
<FL4SHK[m]>
is LLVMPipe not going to be supported in the future?
koike has joined #dri-devel
<koike>
o/ I'm trying to run dim setup, but it fails to update rerere cache and ask me to run git branch --set-upstream-to=<remote>/<branch> rerere-cache , I already removed rerere-cache branch and worktree to see if it would fix but no luck, I'm new to dim tool so I was wondering if anyone could give me pointers here
<koike>
(never mind, looks like the branch didn't really get removed, it seems it worked now, sorry for the noise and thanks for :rubberduck: xD )
kts has quit []
Jscob has joined #dri-devel
urja has quit [Ping timeout: 480 seconds]
<jenatali>
FL4SHK: LLVMPipe runs on the CPU, not a GPU
<jenatali>
There exist drivers that use LLVM for generating GPU code. I think it's just radeonsi at this point. But it has nothing to do with LLVMPipe
<FL4SHK[m]>
Okay
<Venemo>
Lynne: how is the buffer uploaded to the GPU?
<kisak>
Intel has a couple fingers into llvm (OpenCL?)
<FL4SHK[m]>
How difficult would it be to develop a Mesa driver for a new GPU then?
<FL4SHK[m]>
I'm assuming it'd be hard...
Jscob has quit [Read error: Connection reset by peer]
<jenatali>
LLVM is used in frontends like rusticl, yeah. I dunno how much it's used for GPU backends, especially from Mesa
<jenatali>
FL4SHK: It depends
<gfxstrand>
Baseline is "hard". It only goes up from there depending on hardware and what APIs you want to support.
<gfxstrand>
I mean, multiple highschoolers have successfully written Mesa drivers, so...
<FL4SHK[m]>
Hm
<gfxstrand>
But also I've been head down on NVK for 1.5 years and we're just now starting to play games and I'm one of the best there is.
<idr>
Totally average, every day high school students...
<gfxstrand>
From Normal High
<FL4SHK[m]>
I see
<FL4SHK[m]>
I'll keep that in mind
<gfxstrand>
What are you wanting to make a driver for?
<FL4SHK[m]>
a custom FPGA-based GPU
<FL4SHK[m]>
I have part of the instruction set written up
<FL4SHK[m]>
my goal is to have a 500 MHz workstation
rasterman has quit [Quit: Gettin' stinky!]
<FL4SHK[m]>
which should be possible with the hardware I've got
<FL4SHK[m]>
A Xilinx ZCU104
<FL4SHK[m]>
I know it's a lot of work
urja has joined #dri-devel
vliaskov has quit [Ping timeout: 480 seconds]
<Lynne>
Venemo: memory map image on RAM to a vkbuffer, then vkbuffer->vkimage copy
<Lynne>
same but in reverse for downloads
<Venemo>
Lynne: is it possible that there is a sync bug in there somehow?
<Lynne>
validation passes
<Lynne>
we do a barrier before each copy too
<Venemo>
I'm not sure if that is relevant here. by the same logic I could say radv passes the cts
<Lynne>
(it doesn't?)
<Venemo>
it does
<Venemo>
or what do you mean?
<Lynne>
nothing
<Lynne>
disabling host-mapping and falling back to a RAM->vkbuffer + vkbuffer->vkimage copy doen't help
<Venemo>
it is very curious that only some middle part of the image is missing and the rest is correct
<Lynne>
it's always the same part too, everywhere, so it's not a sync issue, I think
<Lynne>
it seems like it could be alignment related somehow, though not sure
<Venemo>
alignment of what?
<heat>
gfxstrand, was NVK considerably harder cuz nvidia? or do you reckon it didn't matter much?
<heat>
well, doesn't, it's still ongoing work ofc :)
<Venemo>
Lynne: what is very peculiar here is that it fails the same way even if I force the code to copy the image line-by-line
<gfxstrand>
heat: It's hard because we're going straight to "can play D3D11 and D3D12 games"
<gfxstrand>
If you just want enough of OpenGL ES 2 to get a desktop up and going it's significantly easier.
<FL4SHK[m]>
I'd like to be able to run some lower end emulators
<FL4SHK[m]>
eventually
<gfxstrand>
NVIDIA hardware is quite nice, actually. That's not at all the problem.
<karolherbst>
the tldr on nvidia is, that the hardware is designed for driver developers
<soreau>
is there a gl(es) driver for nvk-capable hw too, or you mean running $compositor on zink?
<Venemo>
soreau: there is nouveau like always has been
<gfxstrand>
There's a GL driver but Zink+NVK is already starting to outpace it
<soreau>
I see
<soreau>
well, don't forget there's a forest in the trees, somewhere..
<Lynne>
Venemo: <some> alignment, after all, if the image has nicer dimensions it looks fine
<heat>
gfxstrand, i thought it was a PITA to get docs on nvidia hw though? or did you folks solve that situation already?
<Lynne>
I chose that image because it's all odd-sized
<gfxstrand>
We have headers now
<gfxstrand>
Which is a big step up
<gfxstrand>
For the ISA, some folks have access to some docs and we have the PTX docs public which are often helpful.
<gfxstrand>
But developing any GPU driver involves a certain amount of R/E anyway
sukrutb has joined #dri-devel
Omax_ has quit [Ping timeout: 480 seconds]
iive has joined #dri-devel
Omax has joined #dri-devel
<Venemo>
Lynne: does it behave better with even-sized images?
nashpa has quit []
dliviu has joined #dri-devel
<FL4SHK[m]>
if the GPU is open source surely there's no R/E involved other than reading the code
hansg has quit [Quit: Leaving]
<karolherbst>
*doubt*
<FL4SHK[m]>
Doubt for what?
<karolherbst>
GPUs are quite complex
<idr>
FL4SHK[m]: GPUs are complex. Some of the RE is, "What happens if I do these things together in a way nobody really thought about?"
<karolherbst>
and reverse engineering isn't limited to binary blobs
<FL4SHK[m]>
ah
<FL4SHK[m]>
gotcha
<karolherbst>
but yeah.. you might have to figure out how your open source GPU behaves doing certain things as it might not be obvious from the code
<karolherbst>
maybe "debugging" would be the better term here? dunno :)
<FL4SHK[m]>
right
<heat>
i've gone through the intel GPU docs a fair bit... safe to say they don't tell you all the things you need to know
<FL4SHK[m]>
In my case I will be developing both the GPU and the driver
<karolherbst>
and usually: hw trumps the spec/code in any argument
<heat>
and between the thousands of pages of docs and the i915 kernel driver... yeah i gave up on that pretty quickly :)))
lynxeye has quit [Quit: Leaving.]
<gfxstrand>
Well, reverse engineering is just debugging something you don't have the capacity to change.
<airlied>
also a gpu is a lot more than just compute execution units
<gfxstrand>
So you're just replacing debugging something you can't change with debugging something you can.
<karolherbst>
:D fair
<airlied>
those are the fun pieces, but for a useful graphics gpu, you'd probably want texture units at least, and maybe hw blending
alanc has quit [Remote host closed the connection]
alanc has joined #dri-devel
<vsyrjala>
iirc some pirate once said: "hardware docs are more of what you'd call guidelines than an actual description of how the hardware really works"
<gfxstrand>
hehe
<gfxstrand>
Pretty much
<dj-death>
when it's not outright lies
<karolherbst>
hard to tell if anybody actually lies there,b
<gfxstrand>
That's the best part of not having docs. There's nothing to lie to me!
<FL4SHK[m]>
<airlied> "those are the fun pieces, but..." <- not sure there's much more I'll be including in mine
<karolherbst>
but the code to design the hw doesn't have to match it out of various reasons, including bugs
<dj-death>
yeah
<dj-death>
gfxstrand: nice little feeling not to have to trust anybody ;)
<karolherbst>
at least nvidia just leaves out the parts they don't want to share :D
<vsyrjala>
i feel docs are a bit of a double edged sword sometimes. inexperienced developers tend to blindly trust what they read there, and then nothing works correctly
<daniels>
just like the average CV, they're aspirational but unreliable
<karolherbst>
or they end up debugging their own code for days just until somebody tells them the docs are wrong or come to that conclusion on their own or in the worst case give up
urja has quit [Ping timeout: 480 seconds]
Leopold has quit [Remote host closed the connection]
urja has joined #dri-devel
i509vcb has joined #dri-devel
biju has quit []
mszyprow has joined #dri-devel
Leopold has joined #dri-devel
V has quit [Ping timeout: 480 seconds]
V has joined #dri-devel
mszyprow has quit [Ping timeout: 480 seconds]
Mangix_ has joined #dri-devel
Mangix has quit [Ping timeout: 480 seconds]
gouchi has joined #dri-devel
Mangix_ has quit [Ping timeout: 480 seconds]
azerov has quit []
<Lynne>
Venemo: yup
<Lynne>
it's only the width that matters, the height can be od
<Lynne>
*d
<Lynne>
correction: it happens on 1024x1024 images too
<Lynne>
corruption seems to depend on dimensions but so far seems to be random
<Lynne>
the corruption on 1024x1024 does look like an incorrect stride issue for the chroma
<karolherbst>
okay.. the u_trace stuff has a use-after-free :')
<karolherbst>
apparently util_queue_finish doesn't properly clean up the thread started in util_queue_init
<karolherbst>
mhh maybe it's more of a u_queue issue then
rsalvaterra has quit []
rsalvaterra has joined #dri-devel
sima has quit [Ping timeout: 480 seconds]
tursulin has quit [Ping timeout: 480 seconds]
Dr_Who has quit [Read error: Connection reset by peer]
fab has quit [Quit: fab]
<Venemo>
Lynne: so, if you take the same image, but add 1 pixel to each side, it will work?
i-garrison has quit [Remote host closed the connection]
i-garrison has joined #dri-devel
konstantin_ has joined #dri-devel
<Venemo>
Lynne: really weird. can you give me an image to reproduce with at 1024x1024?
konstantin has quit [Ping timeout: 480 seconds]
konstantin has joined #dri-devel
mvlad has quit [Remote host closed the connection]
yann-kaelig has quit [Ping timeout: 480 seconds]
konstantin_ has quit [Ping timeout: 480 seconds]
<Lynne>
better yet, I can teach you to make your own: "ffmpeg -i test.png -vf crop=w=1359:h=1791 -y test_cropped.png"
<Lynne>
you can use any larger image as an input and crop to whatever size you need, the ffmpeg command to upload+download always does a format conversion to yuv422
gouchi has quit [Remote host closed the connection]
<Venemo>
Lynne: by any chance, have you tried if this works on amdvlk?
yann-kaelig has joined #dri-devel
<Venemo>
it probably will
<Lynne>
err, no, I haven't, it's a pain to use since it hamfistedly overrides the default driver
<Venemo>
if it's any consolation, I've found 3 other bugs while chasing this, but neither of them have anything to do with your use case sadly
azerov has joined #dri-devel
Duke`` has quit [Ping timeout: 480 seconds]
azerov has quit []
azerov has joined #dri-devel
yann-kaelig has quit [Ping timeout: 480 seconds]
yann-kaelig has joined #dri-devel
<Lynne>
bad news: it works just fine on amdvlk
<Venemo>
it's not bad news, it means whatever the problem is, it's solvable
<Venemo>
Lynne: can you help me understand how the 3 planes are combined into a single image that I can see in the file?
<Lynne>
err, turns out it works because amdvlk doesn't support vulkan's multiplane 422
<Venemo>
heh
a-865 has quit [Quit: ChatZilla 0.18b1 [SeaMonkey 2.53.18/20231107000034]]
<Lynne>
yeah, so after downloading, there's a conversion step to turn it into rgb to compress it with png and output it
<Lynne>
you can get the exact data out directly via .nut, but you'll just have to launch vlc with a cli arg to start in paused mode rather than play
<Lynne>
internally, the vkformat 1000156004 is used for the images, with the vkimage being non-disjoint, regularly allocated, optimally-tiled
a-865 has joined #dri-devel
azerov has quit []
azerov has joined #dri-devel
azerov has quit []
azerov has joined #dri-devel
<Venemo>
Lynne: I think I found the solution... who the hell would have thought that the size of a plane is not the same as the image extent... the issue seems to be gone if I use vk_format_get_plane_width/height
<Venemo>
give me a few minutes to update the branch
<jenatali>
Venemo: That's what subsampling does, though
<Venemo>
I learn something new and interesting every day in this job
smaeul has quit [Quit: Down for maintenance...]
apinheiro has quit [Quit: Leaving]
ykaelig has joined #dri-devel
yann-kaelig has quit [Ping timeout: 480 seconds]
yann-kaelig has joined #dri-devel
ykaelig has quit [Ping timeout: 480 seconds]
ykaelig has joined #dri-devel
yann-kaelig has quit [Ping timeout: 480 seconds]
sukrutb has quit [Ping timeout: 480 seconds]
mszyprow has quit [Ping timeout: 480 seconds]
crabbedhaloablut has quit []
mszyprow has joined #dri-devel
msizanoen[m] has quit []
<Venemo>
Lynne: I've updated the MR now, can you pls check if the issue is gone?
mszyprow has quit [Read error: Connection reset by peer]