ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
rasterman has quit [Quit: Gettin' stinky!]
Hi-Angel has quit [Ping timeout: 480 seconds]
tursulin has quit [Remote host closed the connection]
adjtm has quit [Read error: Connection reset by peer]
jewins has quit [Ping timeout: 482 seconds]
xexaxo has quit [Read error: Connection reset by peer]
xexaxo has joined #dri-devel
cbaylis has quit [Quit: leaving]
Lightkey has quit [Ping timeout: 480 seconds]
Lightkey has joined #dri-devel
Emantor has quit [Quit: ZNC -]
Emantor has joined #dri-devel
adjtm has joined #dri-devel
sdutt has quit [Read error: Connection reset by peer]
sdutt has joined #dri-devel
boistordu_old has joined #dri-devel
boistordu has quit [Ping timeout: 480 seconds]
vivijim has joined #dri-devel
sumits has quit [Quit: ZNC -]
sumits has joined #dri-devel
Company has quit [Quit: Leaving]
sdutt has quit [Ping timeout: 480 seconds]
luckyxxl has joined #dri-devel
mattrope has quit [Read error: Connection reset by peer]
Lyude has quit [Quit: WeeChat 3.0.1]
Lyude has joined #dri-devel
mlankhorst has joined #dri-devel
Duke`` has joined #dri-devel
alanc has quit [Remote host closed the connection]
alanc has joined #dri-devel
luckyxxl has quit []
lemonzest has joined #dri-devel
soreau has quit [Quit: Leaving]
soreau has joined #dri-devel
rasterman has joined #dri-devel
zzoon[m] is now known as zzoon_holidays_till_21th[m]
yoslin has quit [Ping timeout: 480 seconds]
yoslin has joined #dri-devel
flto_ has joined #dri-devel
flto has quit [Ping timeout: 480 seconds]
danvet has joined #dri-devel
Hi-Angel has joined #dri-devel
gouchi has joined #dri-devel
gouchi has quit [Remote host closed the connection]
pcercuei has joined #dri-devel
jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC:]
rasterman has quit [Quit: Gettin' stinky!]
jernej has joined #dri-devel
iive has joined #dri-devel
Tooniis[m] has quit []
Tooniis[m] has joined #dri-devel
gouchi has joined #dri-devel
Tooniis[m] has quit []
Tooniis[m] has joined #dri-devel
Company has joined #dri-devel
NiksDev has joined #dri-devel
flto_ has quit []
flto has joined #dri-devel
pekkari has joined #dri-devel
camus has joined #dri-devel
camus1 has joined #dri-devel
camus has quit [Ping timeout: 480 seconds]
MrRml[m] has joined #dri-devel
pekkari has quit [Quit: Konversation terminated!]
agd5f has quit [Remote host closed the connection]
tobiasjakobi has joined #dri-devel
graphitemaster has joined #dri-devel
MrRml[m] has quit []
MrRml[m] has joined #dri-devel
MrRml[m] has quit []
MrRml[m] has joined #dri-devel
pekkari has joined #dri-devel
<graphitemaster> I'm curious, what's the rationale for mesa_glthread being enabled per-application? Are threaded optimizations still non-stable and potentially application breaking? The reason I ask is because glBufferSubData performance is consistently better on NV (proprietary) because of it's threaded optimizations __GL_THREADED_OPTIMIZATIONS defaults to 1. It's a of a known fact in graphic circles that NV has fast sub buffer uploads that
<graphitemaster> application developers (myself included) go out of their way to write vendor specific paths for uploads, but it appears that mesa_glthread=true gets there too (but is not default)
<graphitemaster> Tons of applications only use persistently mapped buffers on AMD and Intel via Mesa because it tends to be faster than glBufferSubData. That seems like low hanging fruit if you can just implement one in terms of the other, why not, at least it's comparable to NV performance in my tests - alternatively, threaded optimizations bridge that gap too.
Hi-Angel has quit [Ping timeout: 480 seconds]
<graphitemaster> There's a whole blog entry on buffer mapping patterns on the website here which more or less sniff out the patterns games use but I feel like there's some missing information here because this doesn't consider the implicit double-buffering that well, double buffered vsync affords. I know the NV driver does not issue draws immediately, this is deferred until frame n+1 swap buffer
<graphitemaster> call, so it has a whole frame window to do the upload, which is moved onto the background thread. The way I read this in mesa is that the updates happen in lockstep with the frame.
<graphitemaster> I wonder, does GLTHREAD do the same thing then, move it to a background thread?
cbaylis has joined #dri-devel
flacks_ has joined #dri-devel
flacks has quit [Ping timeout: 480 seconds]
<alyssa> graphitemaster: I don't touch that code but AFAIK the short answer is "mesa_glthread helps games that don't use threading optimizations themselves, but hurts games that are well-written and do" so it's an allowlist for apps that are known to benefit instead of hurt for performance
<graphitemaster> alyssa, Even still, I wouldn't expect MapBuffer with COHERENT_BIT, and memcpy (instead of glBufferSubData) to be faster and yet it's consistently faster than current glBufferSubData in mesa in my tests. May as well just implement glBufferSubData that way then.
<graphitemaster> *MapBufferRange
mlankhorst has quit [Ping timeout: 480 seconds]
<alyssa> 🤷
tobiasjakobi has quit [Remote host closed the connection]
<glennk> what hardware is this on graphitemaster?
yoslin has quit [Quit: WeeChat 3.2]
rsalvaterra_ has joined #dri-devel
<graphitemaster> glennk, My testing hardware is a rig with AMD RX 530, A laptop with Iris Pro Graphics P580, and my desktop with RTX 3070, every machine running latest Arch Linux with mesa-21.1.4-1 (though the desktop rig I can switch between nouveau + mesa and the proprietary drivers for testing with a glvnd and dlopen hack in my engine)
<graphitemaster> glBufferSubData is worse in mesa on all three machines and hardware configurations than MapBufferRange with PERSISTENT and COHERENT bits set.
<imirkin> glBufferSubData has to wait for that buffer to stop being used
<graphitemaster> proprietary NV GL's glBufferSubData outperforms all by a solid 60%
<graphitemaster> And this is without any fancy double buffering or offset within the buffer tricks.
<imirkin> they must buffer the data i guess?
<imirkin> instead of waiting
<graphitemaster> mesa_glthread=true runs better in my tests too.
<graphitemaster> But still nowhere near NV speeds.
rsalvaterra has quit [Ping timeout: 480 seconds]
<zmike> file a mesa ticket with a test case would be my recommendation
<zmike> drawoverhead has a similar case for this ( so you might try modifying that to better represent what you're seeing
<graphitemaster> I know a few things [don't ask] that NV does for data buffering. I know they internally double buffer the updates with respect to the swap buffer call that does vsync, I know they lift those uploads off the main thread onto a background thread within their driver (setting __GL_THREADED_OPTIMIZATIONS=0 reduces performance of the upload) and this one is not within the ability of mesa but NV has stream upload compression on 3000
<graphitemaster> series GPUs where they range encode uploads with an adaptive PPM entropy encoder that removes a lot of 0s in the bitstream, hardware runs a compute shader that decodes and expands that into memory on chip.
<graphitemaster> That last part no one is doing but the proprietary driver so it's muddying my benches for sure. :|
tobiasjakobi has joined #dri-devel
<graphitemaster> Anyways that drawoverhead.c test seems to just be respecifying all the contents every draw, not actually testing glBufferSubData to existing allocated storage that you just replace.
<graphitemaster> Or ping/pong of the buffers ontop of that which is what most games/engines do, at least once I've looked at or worked on.
<zmike> yes, it's for buffer replacement profiling, not what you're describing
<zmike> hence why I said you could try modifying it
<graphitemaster> Ah, okay, my bad, sorry for misunderstanding :)
<zmike> np :)
<glennk> graphitemaster, that 3070, is it running pcie 3 or 4?
<graphitemaster> glennk, pcie 3 x16
<glennk> afaik radeon 530 is pcie 3 x8
<graphitemaster> Don't think that would affect upload performance for 30 MiB a frame worth of data :P
<graphitemaster> Which is 1m point sprite vertices (each vertex 32 bytes in size)
<graphitemaster> Which is my bench
xexaxo has quit [Remote host closed the connection]
vivijim has quit [Quit: Lost terminal]
xexaxo has joined #dri-devel
<glennk> hmm, so you are replacing all the contents with a single call to subdata?
xexaxo_ has joined #dri-devel
xexaxo has quit [Ping timeout: 480 seconds]
yoslin has joined #dri-devel
illwieckz has joined #dri-devel
<graphitemaster> Yeah. The code looks more like allocate a 16 KiB buffer initially with glBufferData (nullptr for initial contents), store that size, and then if the update fits, SubBuffer replace, if it doesn't, in a loop, golden ratio resize the size then glBufferData again to make a new backing storage for that
<graphitemaster> This is how our engine works for streaming buffers, the actual frontend double buffers ontop of this as well.
<glennk> and what are the usage flags for bufferdata?
<graphitemaster> GL_DYNAMIC_DRAW
<glennk> as an experiment, what happens if you use STREAM_DRAW on the radeon?
manu has quit []
manu has joined #dri-devel
manu has left #dri-devel [#dri-devel]
evadot has joined #dri-devel
sdutt has joined #dri-devel
Peste_Bubonica has joined #dri-devel
tarceri_ has joined #dri-devel
tarceri has quit [Ping timeout: 480 seconds]
<graphitemaster> glennk, No performance difference between GL_STATIC_DRAW, GL_STREAM_DRAW, and GL_DYNAMIC_DRAW
<graphitemaster> What is interesting though is that there's a real performance difference on the AMD system with a compatibility GL context and a core profile (3.3) one
<graphitemaster> About 8-12% or so.
<dv_> core profile being the faster one?
<dv_> I dimly remember a weird case where some old desktop GPU actually ran faster with the compatibility context for unknown reasons
<graphitemaster> Yeah, on NV compat profiles tend to be faster in my tests. In this case it's core profile that is faster with AMD on mesa.
<glennk> graphitemaster, can you pastebin lspci -vvv for the amd card?
<glennk> on second thought, also for the nv card
flacks has joined #dri-devel
flacks_ has quit [Ping timeout: 480 seconds]
rsalvaterra_ has quit []
rsalvaterra has joined #dri-devel
pekkari has quit [Quit: Konversation terminated!]
<graphitemaster> Well that's weird, my lspci on the NV right is spitting out pcilib: sysfs_read_vpd: read failed: Input/output error
<graphitemaster> s/right/rig
<graphitemaster> NV is not going to be of much help for you though since not mesa, proprietary driver rn
<graphitemaster> But the other modules are there, I swap to nouveau when I need to test, I can do that if you want
<graphitemaster> Then rerun it, maybe the output is different.
<graphitemaster> I'm really concerned about that pcilib sysfs_read_vpd error though, it seems to happen randomly when I run lspci
<robclark> danvet: just for you, drm scheduler conversion and bonus drm_gem_object_put_locked() removal..
<graphitemaster> Each time it happens dmesg spits out a nice "[4036884.941404] atlantic 0000:42:00.0: invalid short VPD tag 00 at offset 1"
* graphitemaster hopes his LSI controller isn't going
<danvet> robclark, oh nice
<glennk> graphitemaster, some bogus entry in pci rom for the card
<glennk> graphitemaster, so the nv card isn't mapping all of vram, just the usual 256MB window, just wanted to verify that
Hi-Angel has joined #dri-devel
<graphitemaster> I like the bar indices, 0, 1, 3, where did bar 2 go :(
<graphitemaster> region 0, 1, 3 too.
<glennk> random guess the audio subdevice uses 2
Hi-Angel has quit [Remote host closed the connection]
<graphitemaster> Hard to pastebin from the other machine since it's not on the internet but the output is ASUSTeK Computer Inc. Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] and it has two memories (one at c00000000, the other at d00000000) 64-bit prefetchable and the first one is 256 MiB, the other is 2 MiB, ditto for the bars below
<graphitemaster> The Intel machine, first 64-bit, non prefetchable [size=16M], and the second one, 64-bit prefetchable, [size=256M] ... at de000000, and b0000000 respectfully.
<graphitemaster> Oh the AMD also has another 256K memory (32-bit, non-prefetchable) but for some reason it printed after the i/o port line and expansion rom so I missed it.
<graphitemaster> So it appears like all the machines are just mapping 256 MiB of vram.
<glennk> yeah without rebar thats the maximum
<glennk> so attempting to answer your question about mesa_glthread, it basically just marshals the top GL dispatch layer to separate app and driver as much as possible
gouchi has quit [Remote host closed the connection]
<graphitemaster> So it doesn't afford any additional pipelining of the upload then, just relaxes the GL thread of some work, they still operate in lock-step frame wise?
<glennk> what happens for BufferSubData is a bit driver dependent
<glennk> there's a generic codepath which just does a memcpy of the data in the marshaled stream (or malloc the buffer if its large enough)
<glennk> this path basically lets the app continue without waiting on the hardware, unless the marshal command buffer is full in which case it waits
Duke`` has quit [Ping timeout: 480 seconds]
<graphitemaster> I guess the main concern I have is if mesa encounters a draw call which sources said buffer, will it then stall waiting for the upload or is it smart enough to defer the draw call and maybe not stall at all since the contents upload in time for the actual immediate draw?
<graphitemaster> Like where does it wait if any for the upload (if it has to), is it waiting at the draw (client side), at swap buffers call (client side), or at the draw (server side)
<graphitemaster> And if you remove that data dependence with e.g double buffering, can it shift that stall down the pipeline until the actual draw.
<glennk> so on specifically radeonsi, there's another driver thread which actually performs the kernel calls which talk to the hardware
Surkow|laptop has quit [Ping timeout: 480 seconds]
<glennk> which is where the cpu side waits on the hardware happen
rasterman has joined #dri-devel
Surkow|laptop has joined #dri-devel
lemonzest has quit [Quit: Quitting]
<glennk> so subdata with mesa_glthread on radeonsi, it basically creates a staging buffer equivalent to BufferData(WRITE_ONLY, CLIENT_STORAGE|WRITE) then keeps it mapped unsynchronized, and the marshaling memcpy's into that
sdutt has quit [Ping timeout: 480 seconds]
<graphitemaster> When does the staging buffer hit the GPU though.
<graphitemaster> I mean that sounds like the approach a persistently mapped buffer in GL would more or less end up as too.
alyssa has left #dri-devel [#dri-devel]
<glennk> so a copy from system memory staging to vram on radeon is done with CP DMA, so its whenever the hardware command stream gets to that
<glennk> which the more threads you add into the mix, the longer between the api call and the hardware processing it
<glennk> which leads me to ask: what is your app doing in between the subdata call and any draw call using that data?
silver has quit [Ping timeout: 480 seconds]
<graphitemaster> Rendering a whole other frame :P
<graphitemaster> Our engine doesn't actually ever source the contents of an upload or an update to a resource within the same frame, it's always n+1
<graphitemaster> So I update a vertex buffer (as an example), and it won't be until next frame that this vertex buffer will be sourced for a draw call.
<graphitemaster> And our engine does all it's work basically at the end of a frame too, since it has to target multiple APIs, there's no "work" done inbetween GL draw calls, it's just a blast of GL commands one after the other followed by a swap buffers
<graphitemaster> So from the driver's perspective it just gets hit with say 800 GL function calls all immediately at once and then swap.
<graphitemaster> Which probably doesn't give it much time to do anything :P
<glennk> is the buffer object itself used by other draw calls the same frame? ie object dependency not dependency on content subrange
<graphitemaster> No, different buffer in this case, ping/pong the GLuint's, though there's some places in the engine where I do offsets within a buffer because it would be too much memory otherwise.
<graphitemaster> I don't think that makes much of a difference to be honest.
heat has joined #dri-devel
<graphitemaster> The engine also goes out of it's way to only ever update buffers on offsets that are 16-byte aligned and with sizes that are a multiple of 16 too
<graphitemaster> Since that appears to make a big difference on NV proprietary on Windows and Linux.
<glennk> for alignment i would probably say match your cpu cache line size
<glennk> anything mapping vram directly on gcn/navi will have a base address that is 256 byte aligned
<glennk> not the issue here, but as a side note
<graphitemaster> I mean I don't think there's an issue here other than glBufferSubData is slower than persistently mapped buffer with COHERENT for same size uploads involving a memcpy.
<graphitemaster> I would expect persistently mapped buffers to be faster if you avoided a memcpy and produced directly into it even because saving a memcpy is saving work, but this is basically the same amount of work.
<graphitemaster> And it's not the case on NV at all where the roles are swapped.
<graphitemaster> So I just find it more fascinating than anything.
<graphitemaster> I remember having to optimize the upload path for streaming years ago for different hardware and drivers, I just would've thought this is sorted out by now :P
<FLHerne> graphitemaster: fwiw, anholt_ wrote the buffermapping page you mention fairly recently, there's a few comments on the MR here
<glennk> graphitemaster, coherent persistent buffers do the driver work at map time, then its the hardware snooping the updates + your app code synchronizing
<glennk> with subdata every call needs to check hey is this buffer in flight? if not, okay, map it and memcpy, otherwise dump this data in staging buffer and emit a blit
<graphitemaster> I would expect `glBufferData` does the mapping, and `glBufferSubData` just does the same thing though
<graphitemaster> Why would SubData have to do anything other than a memcpy is what I want to know
<graphitemaster> I mean sure all the validation and what not, lets just ignore that for a moment
<glennk> there's not an option to ignore that for a conformant GL driver
<graphitemaster> I mean in the case of this discussion :P
<graphitemaster> It's not possible for SubData update be larger than the original backing allocation requiring a reallocation is it?
<graphitemaster> "size must define a range lying entirely within the buffer object's data store."
<glennk> consider an app calling subdata, draw, then subdata on an overlapping range, then draw
<graphitemaster> Right I understand there's serialization that needs to occur in that case to prevent sourcing the content in a draw while it's being written to, the same is true with coherent mapped buffers, but if you're not causing this (e.g double buffer, or offset within the buffer avoiding overlapping range) then surely the driver can be clever enough to see this and substitute a fast path
ngcortes has joined #dri-devel
<graphitemaster> Anyways talk is cheap, I should probably spend my next weekend poking around mesa to see how things work.
<glennk> yeah its not rocket science when the source code is available
<graphitemaster> Oh I fully expect it to be, I've been led to believe drivers are magic for too long XD
<graphitemaster> I think I found my answer for radeon anyways
<graphitemaster> map + memcpy + unmap
iive has quit []
<glennk> i think the path your app is hitting for radeonsi is discard old resource, map staging, memcpy, unmap, then blit to new
<graphitemaster> So for shits and giggles, suppose `radeonBufferData` did `radeon_bo_map` like it does when data != NULL, but just did not unmap it, radeonBufferSubData could then use it directly, but looks like it needs to store it in obj->Mappings like the persistent mapping does, then I suppose you'd have to unmap it when referenced by a draw. I'm just spit balling ideas at nightime without actually profiling or anything
<glennk> you are looking at the wrong driver :p
<graphitemaster> Oh
<glennk> thats the one for pre-shader radeons
<glennk> src/gallium/drivers/radeonsi
<graphitemaster> I assume si_buffer_subdata
<graphitemaster> Same thing though, it maps with si_buffer_transfer_map, does the memcpy, and unmaps with si_buffer_transfer_unmap
<glennk> the staging bits get decided in si_buffer_transfer_map
<graphitemaster> Right so my main question is, is there a way for this si_buffer_transfer_map, which also does the si_buffer_map once it works out all the usage bits to use for that mapping, to stay mapped as a pointer in the driver so when a si_buffer_subdata is called, it skips doing the map at all and just reuses that pointer?
<graphitemaster> Like I know there's a ton of what ifs here about syncronization and stuff, I just kind of want to know if it's theoretically possible
<graphitemaster> Basically to transparently manage a persistent mapping behind the scenes for subbuffer updates.
<glennk> well if you follow the rabbit hole into radeon_drm_bo.c
<graphitemaster> Not sure what referenced_by_cs is, command stream?, it then issues what looks like an immediate flush
<graphitemaster> When mapping for write
<graphitemaster> I do see a wait in there too, infinite one.
<graphitemaster> But yeah looks like eventually it calls radeon_bo_do_map which returns the existing mapping
<graphitemaster> Though that's after also acquiring a mutex
<graphitemaster> There's a lot of overhead to get a mapping reuse
<graphitemaster> And it still appears to flush in either case.
<glennk> i think your case should hit PIPE_MAP_UNSYNCHRONIZED
<graphitemaster> Humm, yeah and si_buffer_transfer_unmap doesn't actually unmap the buffer, it just signals si_buffer_do_flush_region by the looks of it, which then hits si_copy_buffer, and then that does the real copy with si_cp_dma_copy_buffer, the staging buffer stays persistently mapped.
tobiasjakobi has quit [Remote host closed the connection]
<graphitemaster> So then that's a bit of a dead end.
silver has joined #dri-devel
<graphitemaster> Sorry for keeping you engaged on this goose chase. You've been incredibly patient and kind. I'm going to do a more proper deep dive next weekend I think on the actual AMD rig and see, maybe I'll whip together a proper testcase too you can carry upstream in that perf directory.
<graphitemaster> I'm just really fascinated with why this is the case and how I can bridge the gap here performance wise so no one has to keep writing different streaming upload code for different rigs and systems.
<graphitemaster> It's just too ridiculous to me, UE4 has 12 different streaming upload paths for OpenGL, 12.
Peste_Bubonica has quit [Quit: Leaving]
danvet has quit [Ping timeout: 480 seconds]
<glennk> whats optimal for one bit of hardware is rarely so for another
<graphitemaster> Sure, and it's always been my opinion that the dumbest most basic glBufferData+SubData in the driver should try as hard as it can to be as fast as possible for any given hardware/driver.
<graphitemaster> Since that's always been the case at least with GL performance on NV in my experience.
ngcortes has quit [Remote host closed the connection]
<glennk> btw the compression thing on nv i think is only enabled when pcie link width is < 8x, ie thunderbolt
<graphitemaster> Seems kind of silly it wouldn't use it for streaming contents if it reduces memory bandwidth which is the main problem with streaming.
<graphitemaster> I have so much compute time left over I'd gladly trade all of it for like 50% reducing in memory bandwidth
rsalvaterra_ has joined #dri-devel
sdutt has joined #dri-devel
<glennk> a pcie 3 16x link does ~15GB/s, not a lot of cpu compressors that can output at that speed
rsalvaterra has quit [Ping timeout: 480 seconds]