ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
dongwonk has joined #dri-devel
jhli has quit [Read error: Connection reset by peer]
<graphitemaster>
Oh my god the PCI-E bus is SLOW.
<HdkR>
Compared to memory, yes
<graphitemaster>
Mobile systems map system memory as uncached write-combining memory for the framebuffer and it's damn fast (9000 MiB/s!)
<graphitemaster>
Run the same code on a desktop 20 MiB/s
<graphitemaster>
I'm so salty
<graphitemaster>
I didn't think it was that slow
<HdkR>
Small accesses over PCIe really fuck up your day
<graphitemaster>
I can round trip download a larger bit of data from Australia over my shitty internet connection over like 100 hops faster than I can shove some data to my GPU
<graphitemaster>
If you can barely shove more than 20 MiB/s over PCI-E why even bother with resizable bar, you cann't fill that 128 MiB of mapped BAR fast enough.
<graphitemaster>
This is real sad
<bnieuwenhuizen>
graphitemaster: reads only
<bnieuwenhuizen>
writes are much faster
<graphitemaster>
So if uncached framebuffer reads are so fast on mobile, why even have shadow framebuffer layer in DDX drivers
<bnieuwenhuizen>
also note that even uncached system memory is an issue and that doesn't go over PCIe for CPU accesses
<graphitemaster>
Been tweaking this crap like crazy
<HdkR>
I wish x86 devices were better at saturating memory on a single core. ARM devices are /really/ good at it
<graphitemaster>
I found the right combination of blood sacrifices to almost ready 20 GiB/s on my DDR4 Ryzen (ThreadRipper) machine. Go run it on a system with the exact same memory but a different CPU (newer Zen Ryzen) and it's slower by more than halve.
<graphitemaster>
s/ready/reach/
<graphitemaster>
It's absolutely insane how fickle memory optimizations are on x86
<HdkR>
Meanwhile Apple M1 has a peak of 68.25GB/s and a basic memcpy without optimizing can hit 90% of that
<graphitemaster>
That's absolutely incredible.
<HdkR>
It really is
<graphitemaster>
The dumb aarch64 I'm testing can reach 30 GiB/s with a dumb `while (src != end) *dst++ = *src++;`
<bnieuwenhuizen>
any idea what the bottleneck typically is?
<bnieuwenhuizen>
oad-store queues or is the memory model the main cause?
<HdkR>
graphitemaster: Unroll that about 4x and free speed boost :P
<HdkR>
I don't think TSO is the cause for x86 being slow here. Since if you throw more cores at the problem then it scales pretty well
<graphitemaster>
The cache system of x86 is basically designed to make all memory slow and all thread scaling non-linear
<bnieuwenhuizen>
well, given that the streaming memcpy is significantly faster for VRAM reads I expect TSO is an issue for uncached reads
<graphitemaster>
I mean it's just cache pollution basically.
<graphitemaster>
Anything larger than L2 should probably _always_ use non-temporal / streaming copies.
<HdkR>
graphitemaster: Did you test these optimizations on a system that reports `Fast REP MOVSB`? Like Icelake+?
<graphitemaster>
Not yet, no. I don't own any Intel's anymore
<graphitemaster>
Other than an old iGPU laptop.
<HdkR>
That was about half my system's memory bandwidth on my Ice Lake system
<graphitemaster>
Funny how correlated it is for single core memory performance
<HdkR>
hah, fun
<graphitemaster>
It's honestly impressive how bad Ryzen x86 is at copying and filling memory, especially considering the ridiculous memory clock of high end DDR4.
<graphitemaster>
We're being robbed.
<graphitemaster>
I want the bandwidth of GPUs XD
<graphitemaster>
The L2 and L3 caches on these things are so massive you could just put the whole working set of application in them at this point unless your application is a webbrowser.
<graphitemaster>
Then just have really high latency memory like GPUs
<HdkR>
Apparently Zen3 has fast rep movsb. Which I've not tried :P
<HdkR>
Probably means Zen3 entirely is just better in this regard
<HdkR>
We just need to switch over to ARM and leave x86 behind :)
<graphitemaster>
I have a better idea, what if we just put an x86 instruction decoder on an ARM
<graphitemaster>
I know they have different memory and cache systems but it's probably possible with some tweaking.
<graphitemaster>
The hardware people will figure it out, they work magic.
<graphitemaster>
I have no clue how :P
<HdkR>
oof
YuGiOhJCJ has joined #dri-devel
<alyssa>
I like arm
alyssa has left #dri-devel [#dri-devel]
<HdkR>
Me too
jkrzyszt has quit [Ping timeout: 480 seconds]
pcercuei has quit [Quit: dodo]
boistordu has quit [Ping timeout: 480 seconds]
<graphitemaster>
I like aarch64.
<graphitemaster>
Absolutely hate all the ARM before that
<graphitemaster>
neon, vfp, the fifteen abis, the thousands of configurations and extensions, the thumb / non-thumb mode
<graphitemaster>
big.LITTLE, the fact cores can have different endianess
<HdkR>
AArch64 is its best form, the rest were just the larvae stage
<graphitemaster>
The fact endianess can be changed at runtime >_>
<HdkR>
Oh hey, there is going to be a period of big.little where some cores support AArch32, some don't
<HdkR>
So have fun with that
<graphitemaster>
I remember a buddy told me of the hardest bug he ever had to track down on a big.LITTLE was one where the cores had different cache-line sizes
<HdkR>
yep. Exynos fucked that up for everyone
<graphitemaster>
I was already upset at the mess that was development for the WiiU, PPC console with a semi-programmable ARM9 CPU for the hand-held tablet and the fact that endianess was different on both, cache-line size was different on both, and Nintendo in their infinite dumb wisdom used ILP32 on the console (long long being 64-bit), and ran the handheld in THUMB (16-bit mode) which treats everything but pointers as 16-bit >_>
<graphitemaster>
Very few devs did anything with that, just let the dumb thing stream video and go home.
<graphitemaster>
I swear when companies used to build an ARM they must've had something like kconfig where one dev spent five hours tweaking options for their custom built SoC.
<graphitemaster>
Glad to see some standard fabric from aarch64.
reductum has joined #dri-devel
Company has quit [Quit: Leaving]
camus has joined #dri-devel
heat has quit [Ping timeout: 480 seconds]
mbrost has quit [Ping timeout: 480 seconds]
mbrost has joined #dri-devel
yogesh_mohan has joined #dri-devel
Duke`` has joined #dri-devel
mattrope has quit [Read error: Connection reset by peer]
vivijim has quit [Read error: Connection reset by peer]
dliviu has quit [Ping timeout: 480 seconds]
dongwonk has quit [Remote host closed the connection]
dliviu has joined #dri-devel
mlankhorst has quit [Ping timeout: 480 seconds]
rasterman has quit [Quit: Gettin' stinky!]
gouchi has joined #dri-devel
hch12907 has joined #dri-devel
JohnnyonFlame has quit [Ping timeout: 480 seconds]
thellstrom has joined #dri-devel
pekkari has joined #dri-devel
pcercuei has joined #dri-devel
jkrzyszt has joined #dri-devel
pekkari has quit [Quit: Konversation terminated!]
mlankhorst has joined #dri-devel
thellstrom has quit [Remote host closed the connection]
thellstrom has joined #dri-devel
tzimmermann has quit [Quit: Leaving]
hch12907_ has joined #dri-devel
hch12907 has quit [Ping timeout: 480 seconds]
flacks has quit [Quit: Quitter]
flacks has joined #dri-devel
Company has joined #dri-devel
iive has joined #dri-devel
YuGiOhJCJ has quit [Quit: YuGiOhJCJ]
<emersion>
nice, compiling mesa now hangs my machine when compiling aco_print_ir.cpp
<HdkR>
Nice, running out of ram?
<emersion>
not sure, wasn't able to ssh in
<emersion>
but that sounds likely, works okay (even if very slow) from a clean state\
tobiasjakobi has joined #dri-devel
<haasn>
Is there any way to find out what the "correct" drm fourcc code for a particular opengl image format is? For example, GL_RG8 maps to DRM_FORMAT_GR88 (not DRM_FORMAT_RG88) and I have no idea how I would have known that without looking at the driver source code
<bnieuwenhuizen>
haasn: driver code is sadly the only way I know of ...
<bnieuwenhuizen>
most of the diff seems to be GL/Vulkan specifying things in reversed order of fourcc
<bnieuwenhuizen>
(e.g. mostly in memory order, while fourcc mostly seems to assume little endian or so)
<haasn>
I suppose I just have to eglExportDMABUFImageQueryMESA() and see what fourcc it gives me..
<haasn>
I guess this whole mess is why vulkan decided to just abandon the concept of fourccs in its public API
<haasn>
I didn't realize GL formats had a specified memory order
<haasn>
wait that sentence makes no sense, of course they have a specified memory order, otherwise how would you upload them
<haasn>
anyway I suppose that makes sense, basically because DRM doesn't follow the usual convention of specifying things in memory order for non-packed formats, we need to use inverted format names for non-packed gl formats
<emersion>
and check endianness for packed formats
<haasn>
(fortunately my use case doesn't have the concept of packed formats)
<haasn>
(nor does OpenGL, afaict)
<haasn>
(or if it does, I don't support any on OpenGL yet)
<emersion>
opengl does, if you don't use UNSIGNED_BYTE
<glennk>
hmm, glext.h doesn't declare glDrawArraysInstancedARB for GL_ARB_instanced_arrays, which leads to piglits dispatch code not resolving the function
<emersion>
does it declare the PROC?
<glennk>
no, it just declares the one for glVertexAttribDivisorARB
<emersion>
rip
<glennk>
its a function that is aliased in at least 5 different extensions/core/es versions...
<emersion>
time to send a patch to khronos then i guess
Surkow|laptop has quit [Remote host closed the connection]
Surkow|laptop has joined #dri-devel
<karolherbst>
ehhh.... so maybe I missunderstood how interrupting ioctl works, but if I signal a process with SIGKILL and it's busy with e.g. calling dma_resv_wait_timeout with intr = true isn't the expectation that the ioctl would be canceled asap?
mbrost has joined #dri-devel
<ishitatsuyuki>
kernel threads/operations doesn't always get interrupted although it sounds strange that a polling operation doesn't get interrupted
<karolherbst>
yeah.. my thoughts exactly
heat has joined #dri-devel
mbrost_ has joined #dri-devel
mbrost has quit [Read error: Connection reset by peer]
sravn has quit []
tobiasjakobi has quit [Remote host closed the connection]
gouchi has quit [Remote host closed the connection]
gouchi has joined #dri-devel
<karolherbst>
is there a way to debu dma_reservation objects?
<karolherbst>
*debug
<karolherbst>
I want to know who reserved one without unreserving it
<karolherbst>
or well..
<karolherbst>
just figuring out where it was reserved would help
slattann has joined #dri-devel
V has quit [Ping timeout: 480 seconds]
gpoo has quit [Ping timeout: 480 seconds]
mbrost_ has quit [Ping timeout: 480 seconds]
gpoo has joined #dri-devel
mlankhorst has quit [Remote host closed the connection]
mlankhorst has joined #dri-devel
i-garrison has quit []
Surkow|laptop has quit [Ping timeout: 480 seconds]
i-garrison has joined #dri-devel
Surkow|laptop has joined #dri-devel
jkrzyszt has quit [Ping timeout: 480 seconds]
JohnnyonFlame has joined #dri-devel
V has joined #dri-devel
V has quit [Ping timeout: 480 seconds]
tobiasjakobi has joined #dri-devel
slattann has quit []
gouchi has quit [Remote host closed the connection]