ChanServ changed the topic of #wayland to: https://wayland.freedesktop.org | Discussion about the Wayland protocol and its implementations, plus libinput
<mclasen>
I was thinking of the xeyes one that went by recently
fmuellner_ has quit [Remote host closed the connection]
Company has quit [Ping timeout: 480 seconds]
Company has joined #wayland
mxz_ has joined #wayland
mxz has quit [Ping timeout: 480 seconds]
mxz__ has quit [Ping timeout: 480 seconds]
mxz_ is now known as mxz
Company has quit [Remote host closed the connection]
plutuniun has quit [Remote host closed the connection]
plutuniun has joined #wayland
plutoniun has quit [Ping timeout: 480 seconds]
kts has quit [Quit: Leaving]
kestrel has joined #wayland
gallo has quit [Remote host closed the connection]
kts has joined #wayland
gallo has joined #wayland
___nick___ has joined #wayland
___nick___ has quit []
___nick___ has joined #wayland
kts has quit [Quit: Leaving]
nerdopolis has joined #wayland
eruditehermit has quit [Ping timeout: 480 seconds]
Company has joined #wayland
<nerdopolis>
Curiosity: What would be needed in Wayland or Mesa for display servers to survive the removal of the simpledrm device when the kernel replaces it with the real card?
<kennylevinsen>
many display servers already support new GPUs showing up, but at least wlroots/sway doesn't support changing the device that acts as primary renderer
<kennylevinsen>
so it would need realize the transition is going on, throw away its current renderer and start over with the new one
<zamundaaa[m]>
nerdopolis: simpledrm -> real GPU driver shouldn't need changes in Wayland
<kennylevinsen>
yeah it's just display server logic
<zamundaaa[m]>
A real GPU driver going away, so that you can't use linux dmabuf anymore, that would be more tricky
<nerdopolis>
Yeah, I guess the core protocol is fine, I should have not phrased it that way. (Unless if clients also need to change to be aware)
<kennylevinsen>
well clients might want to start uisng the new renderer
<kennylevinsen>
if they had previously started with a software renderer
<kennylevinsen>
whether intentionally or through llvmpipe
kts has joined #wayland
<nerdopolis>
That would make sense. But for the ones that don't care, like say kdialog or some greeter would be fine?
<karolherbst>
kennylevinsen: mhhh, that reminds me of how macos handles those things. There is a "I support switching over to a new rendered" opt-in flag for applications and the OS signals to applications that this is gonna to happen (because the GPU is going away or because of other reasons), so applications get explicitly told which device to use and when.
<karolherbst>
otherwise they all render on the discrete one (which has its problem for other reasons)
lsd|2 has joined #wayland
<nerdopolis>
Does simpledrm even support dmabuf?
<Company>
switching GPUs is not really supported anywhere because it basically never happens
<Company>
so it's not worth spending time on
<Company>
same for gpu resets
<DemiMarie>
GPU resets absolutely happen
<DemiMarie>
I'm looking at AMD VAAPI here.
<DemiMarie>
And Vulkan video
<Company>
I write a Vulkan driver on AMD, I know that GPU resets do happen
<Company>
but I don't think I've ever had one in GTK's issue tracker for example
<DemiMarie>
I doubt GTK gets the bug reports
<DemiMarie>
Though I will happily write one for GTK not handling device loss.
<Company>
maybe - Mutter goes pretty crazy on GPU resets
<DemiMarie>
That should be fixed
<nerdopolis>
Switching GPUs is starting to be an issue, especially with simpledrm. It was reported against mutter first https://gitlab.gnome.org/GNOME/mutter/-/issues/2909 which was worked around with a timeout, and then sddm
<DemiMarie>
wlroots is working on it, as is Smithay
<Company>
I usually reboot when it happens
<DemiMarie>
KWin handles it already
<DemiMarie>
Company: that is horrible UX
<Company>
you're welcome to implement it, write tests to make sure it doesn't regress and fix all the issues with it
<nerdopolis>
The issue is that amdgpu takes a longer time to start up, so /dev/dri/card0 is actually simpledrm, and then the display server starts using it. Then when amdgpu (or other drivers) finishes simpledrm goes away and gets replaced with /dev/dri/card1
<DemiMarie>
Company: Or not use Mutter
<Company>
DemiMarie: yeah, if you commonly reset your gpu, that's probably not a bad idea - though I would suggest not resetting the gpu
<jadahl>
in mutter the intention to handle it like a gpu reset where everything graphical just starts from scratch. but the situation that simpledrm introduces is not what it is intended for, as simpledrm showing up for a short little while then gets replaced causes a broken bootup experience, so unless it can't be handled kernel side, might need to work around it by waiting a bit if we end up with simpledrm to
<jadahl>
see if anything more real shows up
<DemiMarie>
That said, it might be simpler for Mutter to crash intentionally when the GPU resets, so the user can log back in.
<jadahl>
because we don't want to start rendering with simpledrm, then switching to amdgpu
<jadahl>
(at bootup)
<DemiMarie>
Company: Do AMD GPUs support recovery from resets, or is it usually impossible on that hardware?
<Company>
jadahl: what do you do with all the wl_buffers you (no longer) have in Mutter? Tell every app to send a new one and wait until they sent you one?
<nerdopolis>
jadahl: But what about systems that don't have driver support, and only support simpledrm? Are they going to be stuck with an 8 second timeout?
<jadahl>
Company: we don't really handle gpu resets, so now we don't do anything. in a branch we do a trick for wlshm buffers to have them redraw, but it doesn't handle switching dmabuf main device etc
<Company>
DemiMarie: I'm pretty sure it can be made to work somehow, because Windows seems to be able to do it (the OS, not the apps)
Hypfer is now known as Guest4714
Hypfer has joined #wayland
<jadahl>
nerdopolis: that is the annoying part, they'd get a slower boot experience because the kernel has not the slightest clue if a gpu will show up ever after boot
<Company>
DemiMarie: also, when installing new drivers on Windows it tends to work
<karolherbst>
the one situation where changing the renderer makes sense if you e.g. want to build your compositor in the way, that you have multiple rendering context per display/GPU so you won't have to do the render on discrete GPU -> composite on integrated GPU -> scanout on discrete GPU round trip and stay local to one GPU. So if you move a window from one GPU to another, the compositor _could_ ask the applications to switch the renderer as well to
<jadahl>
(even if it's connected already etc)
<karolherbst>
save on e.g. PCIe bandwidth which is a significant bottleneck with higher resolutions e.g.
<Company>
jadahl: I was wondering about the dmabufs
<jadahl>
Company: the compositor would switch main device, and the clients would need to come up with new buffers
<DemiMarie>
karolherbst: how significant a bottleneck?
<karolherbst>
depends on a looot of things
<Company>
jadahl: right, so you'd potentially be left without buffers for surfaces for a (likely short) while
<karolherbst>
I was playing around with some PCIe link things in nouveau a few years back and I saw differences of over 25%
<karolherbst>
in fps numbers
<jadahl>
karolherbst: or gpu hotplug a beefy one
<karolherbst>
yeah. or that
<karolherbst>
but games usually don't support it, so they would just stick with one
<mclasen>
jadahl: you could give mutter some config to turn off the wait
<jadahl>
Company: indeed, one would need to wait for a little bit to avoid avoidable glitches
<karolherbst>
but the point is, that the round trip to the integrated GPU causes bottlenecks on the PCIe link
<karolherbst>
(but to fix this you'd probably have to rewrite almost all compositors)
<jadahl>
mclasen: how would one set that automatically ?
<mclasen>
you won't
kts has quit [Quit: Leaving]
<zamundaaa[m]>
Demi: I think Company meant to say they're writing a renderer, not a driver
<DemiMarie>
karolherbst: does it really take a full rewrite?
<mclasen>
but a user who cares about fast booting without a gpu could set it
<karolherbst>
not a full one
<karolherbst>
but like instead of having one rendering context for all displays, you need to be more dynamic
<zamundaaa[m]>
amdgpu does support recovering from GPU resets, though it's not completely 100% reliable
<DemiMarie>
zamundaaa: would you mind explaining further?
<karolherbst>
and that can cause quit significiant reworks
Hypfer is now known as Guest4715
<karolherbst>
*quite
<karolherbst>
and then decide where something should be rendered where
<zamundaaa[m]>
Demi: in some situations, recovery just fails for some reason
leandrohrb56 has joined #wayland
<jadahl>
mclasen: sure. it's unfortunate this seems to be needed :(
<zamundaaa[m]>
I don't know the exact details
Guest4714 has quit [Ping timeout: 480 seconds]
<DemiMarie>
I think the hard solution is the best option
<mclasen>
jadahl: yeah, after all these years, booting is still a problem :(
<DemiMarie>
At least if there is enough resources to do it.
<karolherbst>
but yeah.. the PCIe situation with eGPUs is even worse, because you usually don't have a x8/x16 connection, but x4
<Company>
random data: I lose ~10% performance by moving my GTK benchmark to the monitor that is connected to the iGPU
<Company>
which makes no sense because the screen updates only 60x per second anyway, not the 2000x that the benchmark updates itself
<Company>
but it drops from 2050fps to 1850fps
<Company>
probably overhead because the dGPU has to copy the frame to the iGPU
<karolherbst>
if rendering on the discrete GPU?
<Company>
yeah, GTK stays on the dGPU
<karolherbst>
but yeah.. if you have more load on the PCIe bus, command submission will also be slower probably
<Company>
it's just Mutter having to shuffle the buffer from the dGPU to the iGPU
Guest4715 has quit [Ping timeout: 480 seconds]
<Company>
we have ~150k data per frame, at 2000fps that's 300MB/s
<karolherbst>
I've played around with changing the PCIe bus speed in nouveau when I've done the reclocking work. On desktops none of that mattered up, single digit perf gains at most, but on a laptop it was absoultely brutal how much faster things went
Hypfer has joined #wayland
<Company>
actually, probably more because that's just vertex buffer + texture data, not the commands
<Company>
this is on a desktop
<karolherbst>
oh I mean desktop as in single GPU
<Company>
Radeon 6500 dGPU and whatever is in the Ryzen 5700G as the iGPU
<Company>
but the speeds for reading data from the dGPU are slooooow anyway
<DemiMarie>
Even when using DMA?
<karolherbst>
it apparently also matters enough that OS add features so that a dGPU can claim an entire display for itself so no round trips what so ever happen and it matters
<karolherbst>
DMA is still using PCIe
iconoclasthero has joined #wayland
<karolherbst>
you can only push soo much data over PCIe
<DemiMarie>
Can't one usually push quite a few buffers?
<Company>
DemiMarie: I worked on dmabuf-to-cpu-memory stuff recently and it takes ~200ms to copy at 8kx8k image from the GPU
<karolherbst>
PCIe 4.0 x16 is like 32GiB/s
<Company>
note: *from* the GPU, not *to* the GPU
<karolherbst>
VRAM can be like.. 1TiB/s
<DemiMarie>
Company: 8K is way out of scope for now
<Company>
yeah, but that's still a lot less than I expected
<DemiMarie>
For my work at least
<Company>
PCIe does 32GB/s, this is more like 1GB/s
<DemiMarie>
Seems like a driver bug or hardware bug worth reporting.
<Company>
and a 30x difference in speeds is noticable
<karolherbst>
but yeah.. if somebody wants to experiment with splitting compositing between all the GPUs/displays and make apps not have to round-trip to the iGPU, that would be a super interesting experiment to see
<DemiMarie>
I mean I want to initially round-trip via shm buffers initially, because it makes the validation logic so much simpler.
<Company>
karolherbst: before any of that, I need to support switching GPUs in GTK ;)
<karolherbst>
:')
glennk has joined #wayland
<DemiMarie>
But that is because I am starting with software rendering as the baseline.
lanodan has joined #wayland
<karolherbst>
could be a new protocol where the compositor tells clients to change them, or where it simply causes the GL context to go "context_lost" or something, but yeah....
<karolherbst>
I'd really be interested in anybody investigating this area
<Company>
the linux-dmabuf tranches tell you which GPU to prefer, no?
<Company>
I mean, ideally, with a supporting compositor
<karolherbst>
I mean as in dynamically switching
<YaLTeR[m]>
cosmic-comp does the split rendering from what i understand (each GPU renders the outputs it's presenting), though i believe GPU selection happens via separate Wayland sockets where a given GPU is advertised
<karolherbst>
like maybe you disconnect your AC and the compositor forces all apps to go the the iGPU
<Company>
karolherbst: I'd expect to get new tranches
<karolherbst>
YaLTeR[m]: ohh interesting, I should check that out
<Company>
YaLTeR[m]: the problem with that is that I suspect the dGPU is still faster for that output, so just using the GPU that the monitor is conencted to might not be what's best
<Company>
also: people connect their monitors to the wrong GPU all the time
<Company>
there's lots of reddit posts about that
<karolherbst>
yeah.. but the pcie round-trip overhead could be worse
<YaLTeR[m]>
it's actually not that hard to do with smithay infra (in general render with an arbitrary GPU). But it makes doing custom rendering stuff somewhat annoying
<YaLTeR[m]>
Company: a random half of the usb-C ports on my laptop connect to the igpu and the other half to the dgpu
<YaLTeR[m]>
certainly makes it convenient to test multi GPU issues :p
<karolherbst>
oh yeah.. my laptop is USB-C -> iGPU all normal connectors -> dGPU
<Company>
yay
<karolherbst>
there are apparently also laptops where you can flip it
<Company>
I only have my setup for testing
<karolherbst>
and then there are laptops which have eDP on both GPUs and you can move the internal display to the other GPU
<Company>
like what i did 10 minutes ago
<karolherbst>
at runtime
<YaLTeR[m]>
I have that too
<YaLTeR[m]>
Not at runtime tho I don't think
<karolherbst>
you can even make the transition look almost seamingless if you use PSR while the transtion is happening
iconoclasthero has quit [Ping timeout: 480 seconds]
<Company>
karolherbst: fwiw, changing GPUs in GTK would not be too hard to implement (at least with Vulkan) - but I've never seen a need for it
<karolherbst>
I know that some people were interested in getting the eDP GPU switch working
<karolherbst>
yeah.. it might not matter much for gtk apps. Maybe more for apps who also use GL/VK themselves for heavy rendering
<karolherbst>
and then AC -> move to dGPU, disconnect AC -> move to iGPU
<Company>
(there are GTK apps that do heavy rendering)
<karolherbst>
but my setup is already cursed an the dual 4K setup causing the iGPU heavy suffering
<karolherbst>
and apps just ain't at 60fps all the time
<Company>
that can easily be the app
<karolherbst>
(or gnome-shell even)
<Company>
because software rendering at 4k gets to its limits
<linkmauve>
vnd, Weston’s current version is 14, so 8 or 10 are very out of date and likely unsupported, you probably should upgrade that first. I don’t know if the current version supports plane offloading better on your SoC though.
<Company>
plus, software rendering has to fight the app for CPU time
<karolherbst>
sure, but it's hardware rendering here
<Company>
hardware rendering at 4k is fine - at least for GTK apps
<karolherbst>
also on a small intel GPU?
<Eighth_Doctor>
karolherbst: the framework's all-usb-c connections make testing USB-C to GPU pretty easy
<Company>
it should be
<karolherbst>
yeah... but it isn't always here :)
<Company>
not sure how smallthough
<Eighth_Doctor>
and oh my god it's so damn hard to find a good dock that works reliably
<karolherbst>
well it's not terrible
<karolherbst>
but definetly not smooth
<Eighth_Doctor>
a friend of mine and I went through 4 docks from different vendors and none of them worked as advertised because of different quirks with each one
<karolherbst>
but a lot of it is also gnome-shell, and sometimes also just GPU/CPU not clocking up quickly enough
<karolherbst>
but I suspect that's a different issue and not necessarily only perf related
<Company>
probably
nerdopolis has joined #wayland
<Company>
my Tigerlake at 4k gets around 600fps - so I'd expect an older GPU to halve that and a more demanding GTK app to halve that again
<nerdopolis>
I think with the simpledrm case it is somewhat harder in some ways as it's not a GPU reset, but /dev/dri/card0 just completely goes away instead
<karolherbst>
I think it's just all things together here
<Company>
if you then do full redraws on 2 monitors with it, you get close to 60fps
<karolherbst>
e.g. the gnome window overview puts the GPU at like 60% load, but is still not smooth
<Company>
I learned recently that that's usually too many flushes
<karolherbst>
but no idea what's going on there, nor did I check, nor do I think it's related to where things are rendered. Though I can imagine a laptop on AC doing it on the dGPU could speed things up
<karolherbst>
yeah.. could be
<Company>
solution: use Vulkan, there you can just never flush and have even more lag!
<karolherbst>
anyway.. I think it totally makes sense to experiment more with how this all works on dual GPU setups. The question is just how much does it actually matter
<karolherbst>
heh
<karolherbst>
like if a laptop could move entirely to the dGPU, invcluding the desktop and all displays, it could make the experience smoother on insane setups (imagine like 8K displays)
<karolherbst>
and also driving the internal display via the dGPU
<Company>
the problem with that is that you guys screwed up the APIs so much
<karolherbst>
heh
<zamundaaa[m]>
karolherbst: I'm sometimes using an eGPU connected to a 5120x1440@120 display. Without triple buffering, the experience was *terrible*
<Company>
that app devs don't want to touch multi-gpu
<karolherbst>
right...
<Company>
Vulkan is much nicer there
<karolherbst>
yeah with GL it's a mess
<Company>
but everyone but me seems stuck on GL
<karolherbst>
that's why I was wondering if a wayland protocol could help here and the compositor signals via it what GPU to use
<karolherbst>
and the apps only get "you recreate your rendering context now and don't care about the details"
<karolherbst>
and then it magically uses the other GPU
<Company>
what people really want is the GLContext magically doing the right thing
<karolherbst>
yeah.. that's somewhat how it works on macos for like 15 years already
<karolherbst>
they get an event telling them to recreate their rendering stuff
<karolherbst>
and that's basically it
<Company>
that's somewhat complicated though, because that needs fbconfig negotiation and all that
<karolherbst>
(which also means they have to requery capabilities, and because of that and other reasons it's opt-in)
<karolherbst>
or maybe it's opt-out now, dunno
<Company>
too much work for too little benefit I think
<karolherbst>
probably
<karolherbst>
but as I said, if somebody wants to experiment with all of this and comes around with "look, this makes everything sooper dooper smooth, and games run at 20% more fps" that would certainly be a data point
<Company>
it would
<Company>
and I bet it would only work on 1 piece of hardware
<Company>
and a different laptop, probably from the same vendor, would get 20% slower with the same code
<karolherbst>
maybe
<karolherbst>
maybe it doesn't matter much at all
<Company>
I think it does matter on some setups
<karolherbst>
and then eGPUs never became a huge think and dual GPU laptops are also icky enough so a lot of people avoid them
<Company>
because you want to go to the dgpu when not on battery but stay on the igpu on battery
<karolherbst>
yeah
<Company>
first I need to make gnome-maps use the GPU for rendering the map
<Company>
so I have something that can hammer the GPU
<Company>
then I'll look at switching between different ones
<jadahl>
karolherbst: there is already a 'main device' event that allows the compositor to communicate what gpu to use for non-scanout
<Company>
I don't think HDR conversions are enough
<karolherbst>
jadahl: that's for startup only or also dynamically at runtime?
<zamundaaa[m]>
In some configurations, especially with eGPUs, the difference can be far larger than 20%
<karolherbst>
yeah, I can imagine
coldfeet has quit [Remote host closed the connection]
<nerdopolis>
Compositors might have to be changed to support the possibility of the primary GPU going away, correct?
<kennylevinsen>
there isn't a concept of a primary GPU on the system, but applications that depend on a GPU - the display server included - need to do something to handle it going away
azerov has joined #wayland
<nerdopolis>
I'm still more thinking in the case of simpledrm going away I guess. during the transition when simpledrm goes away and the new GPU device initializes...
<Company>
usually there's 2 things involved: 1. bringing up the new GPU, and adapting to the changes in behavior (ie it may or may not support certain features) and 2. figuring out what to do with the data stored in the GPU's VRAM
Moprius has joined #wayland
Brainium has joined #wayland
<nerdopolis>
I think at least things using simpledrm should all be using software rendering, correct?
Moprius has quit [Remote host closed the connection]
<jadahl>
nerdopolis: yes. i guess in theory one could have an accelerated render node with no display controller part, where one renders with acceleration, but displays it via simpledrm, but that is probably not a very common setup
<nerdopolis>
Probably not, simpledrm is only there if the real driver hasn't loaded yet, OR they have some obscure card that is not supported by the kernel at all. I mean I guess I never tested the bootvga driver being unsupported, with the secondary device having a valid mode setting driver...
Hypfer has quit [Ping timeout: 480 seconds]
<emersion>
it can happen
<emersion>
e.g. nouveau doesn't have support for a newer card
<emersion>
(yet)
<emersion>
in general, any case where kernel is old and hw is new
<nerdopolis>
Ah, that makes sense
iconoclasthero has joined #wayland
<MrCooper>
geez, you leave for a couple of hours of grocery shopping, and this channel explodes :)
<MrCooper>
Company: we get a fair number of bug reports about AMD GPU hangs against Xwayland (and probably other innocent projects), thanks to radeonsi questionably killing the process after a GPU reset if there's a non-robust GL context
<MrCooper>
Company DemiMarie: AMD GPU resets generally work fine, the issue is that most user space can't survive GPU resets yet
<Company>
I have no idea how you'd want to handle GPU resets in general
<DemiMarie>
Company: Recreate all GPU-side state.
<Company>
like, you'd need to guarantee there's no critical data on the GPU
<MrCooper>
Company: mutter should copy buffer data across PCIe only once per output frame, not per client frame
<DemiMarie>
Exactly
<MrCooper>
Company: in a nutshell, throw away the GPU context and create a new one
<DemiMarie>
Having critical data on the GPU is a misdesign.
<Company>
is it?
<kennylevinsen>
Apps will have to rerender and submit new frames, compositor will need to rerender and have windows be black until it gets new frames...
<Company>
so, let's assume I have a drawing app - do I need to replicate the drawing on the CPU anticipating a reset?
<DemiMarie>
Company: no, you just rerender everything
<Company>
or can I do the drawing on the GPU until the user saves their document?
<Company>
I mean, for a compositor that's easy - you just tell all the apps to send you a new buffer
<Company>
because there's no critical data on the GPU
<Company>
but if you have part of the application's document on the GPU?
<DemiMarie>
Company: Don't drawing apps typically keep some state beyond the bitmap?
<MrCooper>
except there's no mechanism for that yet
<kennylevinsen>
even for an app I would expect that the state that drove rendering exists in system memory to allow a later rerender
<DemiMarie>
That would be a misdesign
<DemiMarie>
kennylevinsen: exactly
<Company>
kennylevinsen: esepcially with compute becoming more common, I'd expect that to not be the case
<zamundaaa[m]>
MrCooper: when a GPU reset happens, apps that are GPU accelerated will know on their own to reallocate
<zamundaaa[m]>
Company: in some cases, data could get lost, yes
<DemiMarie>
Company: then rerun the compute job
<kennylevinsen>
DemiMarie: I don't think it's appropriate to call it a misdesign per say, there could be uses where the caveat of state being lost on GPU reset is acceptable
<zamundaaa[m]>
Just like with the application or PC crashing for any other reason, apps should do regular saving / backups
<kennylevinsen>
I just don't expect that to generally be the case
<MrCooper>
zamundaaa[m]: the app needs to actively handle it, the vast majority don't
<DemiMarie>
Generally, you should preserve the inputs of what went into the computation until the output is safely in CPU memory or on disk
<kennylevinsen>
"safely in CPU memory" heh
<DemiMarie>
Which is a bug in most apps
<zamundaaa[m]>
MrCooper: yes, but in that case, requesting a new buffer is useless anyways
<DemiMarie>
kennylevinsen: you get what I mean
<MrCooper>
that's a separate issue
<DemiMarie>
So yes, GTK should be able to recover from GPU resets.
<zamundaaa[m]>
how so? If the app handles the GPU reset, it can just submit a new buffer to the compositor after recovering from one
<MrCooper>
zamundaaa[m]: if the compositor recovers from the reset after a client, it might not be able to use the last attached client buffer anymore, in which case it would need to ask for a new one
<zamundaaa[m]>
The only reason some compositors need to request new buffers from apps is that they release wl_shm buffers after uploading the content to the GPU
<zamundaaa[m]>
MrCooper: right, if the compositor wants to explicitly avoid using possibly tainted buffers
<zamundaaa[m]>
or rather, buffers with garbage content
<MrCooper>
that's not the issue, it's not being able to access the contents anymore
<zamundaaa[m]>
MrCooper: it can access the contents just fine
<MrCooper>
not sure it's really needed though, keeping around the dma-buf fds might be enough
<zamundaaa[m]>
Not the original one, if the buffer is from before the GPU reset, but it can still read from the buffers and get something as the result
<DemiMarie>
zamundaaa: is that true on all GPUs?
<DemiMarie>
I would not be surprised if that just caused another GPU fault.
<Company>
my main problem is that GTK wants to use immutable objects that do not change once created - and a GPU reset changes those objects
<Company>
so now you need a method to recover from immutable objects mutating
<Company>
which is kinda like wl_buffer
<DemiMarie>
Company: Can you make each object transparently recreate its own GPU state?
<Company>
which is suddenly no longer immutable either because the GPU just decided it's bad now
<DemiMarie>
Or recreate everything from the API objects?
<Company>
DemiMarie: not if it's a GL texture object
<Company>
and no idea about dmabuf texture objects
<DemiMarie>
Company: those are not immutable
<Company>
and even if I could recreate them, they'd suddenly have new sunc points
<DemiMarie>
anything that is on the GPU is mutable
<kennylevinsen>
the world would be much nicer if GPUs didn't reset :)
<Company>
DemiMarie: those are immutable in GTK per API contract - just like wl_buffers
<DemiMarie>
Company: that seems like an API bug then
<llyyr>
you dont need to deal with gpu resets if you don't do hardware acceleration
<DemiMarie>
Apps need to recreate GPU buffers if needed
<psykose>
you don't need to deal with software if you don't have hardware yea
<DemiMarie>
kennylevinsen: I believe Intel GPUs come close. They guarantee that a fault will not affect non-faulting contexts unless there are kernel driver or firmware bugs.
<zamundaaa[m]>
Demi: I *think* that it's a guarantee drivers with robust memory access have to make
<Company>
DemiMarie: that's the question - you can decide that things are mutable, but then suddenly eveyrthing becomes mutable and you have a huge amount of code to write
<DemiMarie>
Company: that seems like the price of hardware acceleration to me
<kennylevinsen>
DemiMarie: I imagine the cause of resets is generally such bugs, so not sure how helpful that guarantee is
<kennylevinsen>
but amdgpu does have above-average reset occurrence
<Company>
DemiMarie: same thing about mmap() - the kernel could just mutate your memory and send you a signal so you need to recreate it - why not?
<DemiMarie>
kennylevinsen: On many GPUs userspace bugs can bring down the whole GPU.
<DemiMarie>
Company: because CPUs provide proper software fault containment
<Company>
DemiMarie: I don't think that's a useful design though - I think a useful design is one where the kernel doesn't randomly fuck with memory
<Company>
DemiMarie: so make it happen on the GPU
<DemiMarie>
Company: Complain to the driver writers and hardware vendors, not me.
<Company>
I am
<DemiMarie>
Via which channels.
<DemiMarie>
?
<Company>
but I think it's fine if I just write my code assuming those things can't happen and wait for hardware to fix their stuff
<kennylevinsen>
Company: currently, resets happen Quite Often™ on consumer hardware
<Company>
instead of designing an overly complex API working around that misdesign
<kennylevinsen>
so you probably have to expect them for now
<DemiMarie>
I can say that on some GPUs, you may be able to get that guarantee at a performance penalty, because no more than one context will be able to use the GPU at a time.
<Company>
kennylevinsen: not really - people complain way more about other things
<DemiMarie>
kennylevinsen: how often do you see them on Intel?
checkfoc_us9 has quit []
<Company>
there's also this tendency of the lower layer developers to just punt all their errors to the higher layers and then blame those devs for not handling them
<Company>
which is also not helpful
<kennylevinsen>
there is indeed an issue with hardware issues getting pushed to software, but we tend to get stuck dealing with the hardware we got as our users have it
<Company>
"the application should just handle it" is a very good excuse
<Company>
my favorite example of that is still malloc()
checkfoc_us9 has joined #wayland
<kennylevinsen>
DemiMarie: I'm not a hardware reliability database - anecdotally, I have only seen a few i915 resets, but have had on amdgpu where opening chrome or vscode would cause a reset within 10-30 minutes, which was painful before sway handled resets
<DemiMarie>
Company: GPU hardware makes fault containment much harder than CPU hardware does.
tzimmermann has quit [Quit: Leaving]
<DemiMarie>
kennylevinsen: that makes sense
<kennylevinsen>
s/have had/have had periods/
<Company>
on Intel, a DEVICE_LOST because of my Vulkan skills don't reset the whole GPU
<Company>
on AMD, a DEVICE_LOST makes me reboot
<DemiMarie>
Company: that's what I expect
<kennylevinsen>
here, DEVICE_LOST just causes apps that don't handle context resets to exit
<zamundaaa[m]>
kennylevinsen: about "amdgpu does have above-average reset occurrence", not so fun fact, amdgpu GPU resets are the currently third most common crash reason we get reported for plasmashell
<kennylevinsen>
but to amd's credit, resets have reduced and they also appear to resort to context loss less often?
<kennylevinsen>
zamundaaa[m]: dang
<Company>
dunno, I write code that doesn't lose devices
<Company>
I don't want to reboot ;)
<zamundaaa[m]>
kennylevinsen: in my experience, GPU resets are at least recovered from correctly lately
<zamundaaa[m]>
While plasmashell may crash, KWin recovers, and some other apps do as well
<DemiMarie>
zamundaaa: what are the first two?
<kennylevinsen>
zamundaaa[m]: it was a huge user experience improvement when sway grew support for handling context loss
<zamundaaa[m]>
Xwayland's the bigger problem
<zamundaaa[m]>
Demi: something Neon specific, and something X11 specific
<kennylevinsen>
hmm yeah, losing xwayland is more jarring even if relaunched
<DemiMarie>
Company: in theory, I agree that GPUs should be more robust. In practice, GTK should deal with device loss if it doesn't want bad UX.
<zamundaaa[m]>
kennylevinsen: it's worse. Sometimes KWin hangs in some xcb function when Xwayland kicks the bucket
<DemiMarie>
zamundaaa: Neon?
<kennylevinsen>
oof
<Company>
DemiMarie: I think it's not important enough for me to care about - and I expect it to get less important over time
<zamundaaa[m]>
Demi: KDE Neon had too old Pipewire, which had some bug or something. I don't know the whole story, but it should be solved as users migrate to the next update
<Company>
DemiMarie: but if someone wants to write patches improving things - go ahead
<Company>
same thing with malloc() btw - Gnome still aborts on malloc failure just like it did 20 years ago. I'm sure that could be improved but nobody has bothered yet
<DemiMarie>
zamundaaa: AMD is working on process isolation, which will hopefully make things better, but it will be off by default unless distros decide otherwise.
<Company>
AMD turns everything off that may make fps go down
<nerdopolis>
I feel like the case where the main driver is slow to load, and the login manager greeter display server uses simpledrm, and then gets stuck in limbo when the kernel kicks it out is starting to be more common too
<nerdopolis>
kwin handles it the best because of kwin_wayland_wrapper, (but only Qt applications so far)
<nerdopolis>
other display servers like Weston, they hang when I boot with modprobe.blacklist virtio_gpu have them start with simpledrm and then modprobe virtio-gpu
rgallaispou has quit [Read error: Connection reset by peer]
<kennylevinsen>
amdgpu being as slow to load as it is should also really be fixed...
<jadahl>
kennylevinsen: I'd also like a generic "i'm gonna start trying to load a gfx driver now" signalling so one can conditionalize waiting in userspace on that
<jadahl>
but it seems non-trivial to make such a thing possible
<MrCooper>
Company: malloc never fails, for a very good approximation of "never"
kenny1 has quit [Ping timeout: 480 seconds]
<DemiMarie>
kennylevinsen: what makes it so slow?
<kennylevinsen>
you'd have to ask amdgpu devs that question
<kennylevinsen>
firmware loading perhaps?
<Company>
MrCooper: that took a while though - 25 years ago when glib started aborting, malloc() did fail way more
<MrCooper>
DemiMarie: one big issue ATM is that the amdgpu kernel module is humongous, so just loading it and applying code relocations takes a long time
<MrCooper>
Company: I'll have to take your word for it, can't remember ever seeing it fail in the 25 years I've been using Linux
<MrCooper>
of course one can make it fail by disabling overcommit, that likely results in a very bad experience though
<DemiMarie>
MrCooper: why the kernel module so large? Is it because of the huge amount of copy-and-pasted code between versions?
<Company>
MrCooper: when I worked on GStreamer in the early 2000s, I saw that happen sometimes
<Company>
also because multimedia back then took lots of memory
<MrCooper>
DemiMarie: mostly because it supports a huge variety of HW
<Company>
*lots of memory relative to system memory
<DemiMarie>
Also I wonder if in some cases the initramfs will only have simpledrm, with the hardware-specific drivers only available once the root filesystem loads.
<nerdopolis>
I think there is one distro that actually does that with the initramfs, I could be wrong
<nerdopolis>
DemiMarie: And I think it does it if the volume is not encrypted or something, so it's not all installs. I think its ubuntu, but don't quote me on that