cmarcelo has quit [Remote host closed the connection]
rpigott has quit [Read error: Connection reset by peer]
pitust has quit [Read error: Connection reset by peer]
ella-0 has quit [Remote host closed the connection]
sumoon has quit [Remote host closed the connection]
rosefromthedead has quit [Remote host closed the connection]
kuruczgy has quit [Remote host closed the connection]
kennylevinsen has quit [Remote host closed the connection]
ifreund has quit [Remote host closed the connection]
mainiomano has quit [Remote host closed the connection]
kchibisov has quit [Remote host closed the connection]
cmarcelo has joined #dri-devel
kennylevinsen has joined #dri-devel
kuruczgy has joined #dri-devel
mainiomano has joined #dri-devel
ella-0 has joined #dri-devel
rosefromthedead has joined #dri-devel
sumoon has joined #dri-devel
kchibisov has joined #dri-devel
ifreund has joined #dri-devel
rpigott has joined #dri-devel
pitust has joined #dri-devel
tursulin has joined #dri-devel
tanty has quit [Ping timeout: 480 seconds]
<pq>
tzimmermann, what do you think of using fbdev UAPI to drive keyboard RGB leds? :-p
<tzimmermann>
pq, wat? go away!
<pq>
lol
<tzimmermann>
wasn't the discussion about auxdisplay?
<pq>
yeah, I saw fbdev code in the auxdisplay driver mentioned.
* ccr
nukes RGB leds from the orbit
<pq>
is there another UAPI for auxdisplay, too?
<pq>
I couldn't tell if cfag12864b.c had any UAPI in it, but cfag12864bfb.c seems to use fbdev things? Are they parts of the same driver, or two separate drivers for the same thing?
<tzimmermann>
auxdisplay is "all the rest" that didn't fit anywhere else AFAICT
<pq>
just wondering and stirring the pot, no big deal for me :-)
<tzimmermann>
i've just glanced over that discussion. OMG
<tzimmermann>
please let us not treat keyboard leds like regular displays
<pq>
:-D
<pq>
btw. kernel docs say: "The cfag12864bfb describes a framebuffer device (/dev/fbX)."
<tzimmermann>
fbdev and drm should be reserved for display that show the user's console or desktop
jkrzyszt has joined #dri-devel
<tzimmermann>
but not some status information or blinky features
<tzimmermann>
pq, indeed. some of the ausdisplay HW seems to be some kind of led device. so there's an fbdev device for it. whether that makes is questionable
tanty has joined #dri-devel
<tzimmermann>
i think jani made a good point about handling these leds in the input subsys
lemonzest has quit [Quit: WeeChat 4.2.1]
bolson has quit [Remote host closed the connection]
lemonzest has joined #dri-devel
bmodem has quit [Ping timeout: 480 seconds]
shankaru has quit [Remote host closed the connection]
ninjaaaaa has quit [Read error: Connection reset by peer]
simondnnsn has quit [Read error: Connection reset by peer]
simondnnsn has joined #dri-devel
ninjaaaaa has joined #dri-devel
Leopold has quit [Ping timeout: 480 seconds]
Leopold has joined #dri-devel
apinheiro has joined #dri-devel
bmodem has joined #dri-devel
simondnnsn has quit [Ping timeout: 480 seconds]
surajkandpal has quit [Ping timeout: 480 seconds]
simondnnsn has joined #dri-devel
flynnjiang has quit [Ping timeout: 480 seconds]
rasterman has joined #dri-devel
kts has joined #dri-devel
Leopold has quit [Remote host closed the connection]
Leopold has joined #dri-devel
Leopold has quit [Remote host closed the connection]
Leopold_ has joined #dri-devel
bmodem has quit [Ping timeout: 480 seconds]
kts has quit [Remote host closed the connection]
DodoGTA has quit [Remote host closed the connection]
DodoGTA has joined #dri-devel
Calandracas has quit [Remote host closed the connection]
kts has joined #dri-devel
Calandracas has joined #dri-devel
kts_ has joined #dri-devel
kts_ has quit [Remote host closed the connection]
fireburn has quit []
fireburn has joined #dri-devel
kts has quit [Ping timeout: 480 seconds]
YuGiOhJCJ has quit [Remote host closed the connection]
YuGiOhJCJ has joined #dri-devel
aravind has quit [Ping timeout: 480 seconds]
yyds has joined #dri-devel
linusw has joined #dri-devel
kts has joined #dri-devel
kts has quit [Remote host closed the connection]
kts has joined #dri-devel
YuGiOhJCJ has quit [Quit: YuGiOhJCJ]
tanty has quit [Quit: Ciao!]
tanty has joined #dri-devel
DodoGTA has quit [Quit: DodoGTA]
heat has joined #dri-devel
DodoGTA has joined #dri-devel
Jeremy_Rand_Talos_ has joined #dri-devel
DodoGTA has quit [Quit: DodoGTA]
DodoGTA has joined #dri-devel
Jeremy_Rand_Talos has quit [Ping timeout: 480 seconds]
<mareko>
karolherbst: if it's useful, radeonsi could do SVM where CPU pointer == GPU pointer
<mareko>
karolherbst: we can implement pipe_screen::resource_from_user_memory to do that by default with the amdgpu kernel driver, or based on a a flag
Dr_Who has joined #dri-devel
rgallaispou has left #dri-devel [#dri-devel]
ninjaaaaa has quit [Ping timeout: 480 seconds]
simondnnsn has quit [Ping timeout: 480 seconds]
<karolherbst>
mareko: yeah.. that's how I plan to implement non sytem SVM
kzd has joined #dri-devel
<karolherbst>
I have a prototype based on iris, but it blew up the applications VM
<karolherbst>
kinda need to find some time and properly think it all through
KetilJohnsen has quit [Ping timeout: 480 seconds]
<karolherbst>
the biggest issue is just how to synchronize the VMs on both sides properly
<karolherbst>
like.. if the driver allocates a bo, which could be used for global memory, it probably also needs to `mmap` at the same location on the CPU side. Or like mmap on the CPU first and then just place the bo at the same location on the GPU side
<karolherbst>
and my plan was to add a "SVM" flag to pipe_resource_Flags so the driver knows it's a SVM thing or so
<karolherbst>
sadly, the story for discrete GPUs is way more complex, because you obviously don't want to operate on host memory, just mapped on both sides and I didn't even get to the point where I'd do memory migration
Calandracas has quit [Remote host closed the connection]
<DemiMarie>
robclark: if kernel submission does not protect the GPU or its firmware in any way, then userspace submission is an improvement!
Haaninjo has joined #dri-devel
<robclark>
no, I think more likely it is just a false sense of security, tbh
jrelvas has joined #dri-devel
jrelvas has quit [Remote host closed the connection]
<DemiMarie>
robclark: I see! Do GPU vendors generally do a decent job at writing firmware?
bolson has joined #dri-devel
<DemiMarie>
robclark: how hard will it be to proxy the doorbells? It is strictly unsafe to pass MMIO to a VM under Intel unless the MMIO behaves like memory, in that reads return the just-read value and both reads and writes complete in bounded time.
<robclark>
well, there is a pretty wide range of what can be called firmware, ranging from things that have some sort of RTOS to things that are somewhat more limited
<robclark>
but on-gpu escapes is much more rare than more mundane UAF type bugs
macromorgan_ has joined #dri-devel
macromorgan_ has quit [Remote host closed the connection]
Calandracas has joined #dri-devel
<mareko>
karolherbst: what I meant is that we can assign any GPU address to any buffer if the address range is unused, and the whole address range used by the CPU is always unused because our GPU allocations choose addresses that CPU allocations wouldn't use, and that's for SVM. For resource_from_user_memory, we can use the CPU pointer as the requested GPU address for the buffer, which is the most trivial case.
<mareko>
resource_create is more involved because you would have the pass the desired GPU address to it.
<robclark>
hmm, I'm not entirely familiar w/ the doorbell issue.. I would have expected it to work because that is basically how sr-iov works (although maybe not on past/current devices, idk)
<karolherbst>
mareko: like.. if I'd allocate a pipe_resource and would map it, the mapped address would also need to be the same as seen on the GPU
<karolherbst>
but
<karolherbst>
if the driver can promise, that addresses of GPUs bos are either reserved or won't be able to be used by CPU allocators, that might be good enough. There is then the question of how would synchronization work between the host and the GPU
macromorgan has quit [Ping timeout: 480 seconds]
<mareko>
like I said, our GPU BOs use address that CPU allocators wouldn't use
<mareko>
*addresses
<karolherbst>
but it also highly depends if we are talking about system SVM or not here. For non system SVM the allocations are explicit. For system SVM any CPU pointer needs to be valid also for the GPU
<karolherbst>
like.. wouldn't or won't use?
<mareko>
won't
<karolherbst>
who is managing the VM for radeonsi btw? Is that the kernel or is it done in userspace?
macromorgan has joined #dri-devel
<mareko>
our GPU VM design is that all CPU addresses that the process can use are currently never used by GPU alloations
<mareko>
so the kernel could mirror the whole process address space
<mareko>
into the GPU
<karolherbst>
okay
<karolherbst>
yeah, that sounds like more it's designed to implement system SVM things :)
<mareko>
it seems, but no
<mareko>
amdkfd does that mirroring, while amdgpu requires explicit VM map calls
<karolherbst>
so I guess host allocations still need to be "imported" via userptrs or something
<karolherbst>
mareko: I think I just have two questions then: 1. if I map a pipe_resource, can the mapped pointer be the GPU address valid for the CPU? and 2. Could I allocate a `pipe_resource` in a way that it's placed at a given address?
<karolherbst>
like..
<karolherbst>
given address as "on the CPU side there is an allocation I want to mirror inside VRAM"
<karolherbst>
there is a USM extension which allows for very explicit placement and migration, so I might want to have the ability to move memory between system RAM and VRAM, but placed at the same address on both sides
<mareko>
it's about page table mirroring, VRAM or GTT placement doesn't matter
<mareko>
2 is trivial, you can choose the GPU address for any created pipe_resource
<mareko>
and any imported pipe_resource
<karolherbst>
okay, and it can also be an address which already exists on the CPU's side?
<mareko>
yes
<karolherbst>
okay, yeah, that should be good enough then
<mareko>
I don't know about 1 since we assign GPU addresses that the CPU process wouldn't use, so I don't if mmap can even use them
<mareko>
*I don't know
<karolherbst>
1 is a hard requirement by CL sadly. I could do the reverse way: allocate on the host and then userptr import it, but... how would I get the memory to be migrated into VRAM?
<mareko>
you wouldn't
<mareko>
1 is only dependent on mmap being usable, not on the driver
<karolherbst>
yeah :) that's the problem and why I'd like to allocate a pipe_resource instead and just make sure the address to which it gets mapped is the same. `mmap` does allow you to specify where you want to map something though
<karolherbst>
but it's not guaranteed to succeed afaik
<karolherbst>
but that's someting I could play around with
<mareko>
if you use a normal BO and you access it with a CPU and the BO is in invisible VRAM, it will cause a CPU page fault and the kernel will migrate it to GTT and keep it there
<karolherbst>
could I force the migration without having to access it? Or would I have to touch every page? Or just one page?
<karolherbst>
like.. if I can read at offset 0x0 and it would migrate the entire allocation that's good enough
<karolherbst>
though I don't really need that, as memory migration is just a hint on the API level
<karolherbst>
explicit migration I mean
<mareko>
the migration is forced by touching the page with a CPU, and it migrates the whole buffer
<karolherbst>
okay
<mareko>
recent CPUs and BIOSes allow all VRAM to be visible
<karolherbst>
so the only thing to figure out would be the mapping thing then. But yeah, a driver guaranteeing that bo's won't overlap with CPU allocation is indeed a big help
<karolherbst>
or rather, with mappings in general
<MrCooper>
note that any CPU reads from VRAM will throw you off a performance cliff
<karolherbst>
yeah, that's why I want to be able have allocations on both sides at the same address
<karolherbst>
so I can do explicit migrations
tzimmermann has quit [Quit: Leaving]
anujp has joined #dri-devel
yyds has quit [Remote host closed the connection]
junaid has joined #dri-devel
junaid has quit [Remote host closed the connection]
<CounterPillow>
didn't know Phoronix even banned people
<CounterPillow>
>What would need to happen would be for the media player to be able to ask the compositor if it can just hand it the raw YUV video data. If the compositor supports that and uses display planes to handle it, then the media player can just share the YUV images rather than RGB, cutting out the GFX work in the media player.
<CounterPillow>
mpv already has a VO for this (dmabuf_wayland)
<karolherbst>
impressive
<CounterPillow>
it's a fairly low amount of code last I checked
<CounterPillow>
obviously, it will never be the default, because both compositors and hardware are too spotty with their implementations, and it likely uses a lower quality scaler than mpv's current defaults, and also iirc it currently requires hwdec which is also unlikely to be turned on by default judging by how often AMD manages to find ways to break it
<MrCooper>
CounterPillow: "banned" in quotes, he's been posting as "avis" ever since the birdie account was banned, doesn't even to try to hide it's him, but nothing happens
<CounterPillow>
heh
<karolherbst>
anyway.. if people are under the impression that users are wasting time or are in other ways disrespectful, we can certainly discuss this, but I haven't really seen much from birdie on gitlab besides maybe wasting times or making some out of place remarks
<MrCooper>
even so, a Phoronix "ban" is some kind of achievement I guess
<MrCooper>
karolherbst: yeah I was joking, let's hope he's not just getting warmed up though
<karolherbst>
nah
<CounterPillow>
yeah it's probably better not to shittalk people here even if they deserve it
<karolherbst>
birdie is being active on the gitlab for years now, so I guess it's fine
<karolherbst>
well.. "active"
<MrCooper>
CounterPillow: if he doesn't want to get called out for what he's posting on the Phoronix forums, he can always stop, we'd all be better off for it
<CounterPillow>
Personally I simply do not read places with a high frequency of bad posts
<Lynne>
9a00a360ad8bf0e32d41a8d4b4610833d137bb59 causes segfaults on wayland
<Lynne>
is pelloux here to discuss? I'd rather not send a revert MR
tursulin has quit [Ping timeout: 480 seconds]
<MrCooper>
pepp: ^
<pepp>
Lynne: annoying. Do you have more details?
<Lynne>
mpv and firefox crash instantly, in libvulkan_radeon
<Lynne>
running on sway with the vulkan backend
<Lynne>
wlroots generally causes programs to do a swapchain rebuild twice in a quick succession on init, maybe it's related to this?
ity has quit [Remote host closed the connection]
ity has joined #dri-devel
<pepp>
Lynne: I guess it's missing a "if (chain->wsi_wl_surface)" check
<DemiMarie>
Regarding SVM: SVM can be perfectly compatible with virtio-GPU because while the guest userspace program doesn’t make any explicit requests to make memory accessible to the GPU, the guest kernel can make these requests to the host.
<DemiMarie>
CounterPillow: is hardware decoding unreliable under desktop Linux?
<CounterPillow>
yes
<DemiMarie>
CounterPillow: why?
<CounterPillow>
bugs
<DemiMarie>
in what?
<CounterPillow>
the driver
<Lynne>
pepp: can confirm that fixes it
<DemiMarie>
Is this because of distributions shipping old versions of Mesa?
jkrzyszt has quit [Ping timeout: 480 seconds]
<CounterPillow>
no
<CounterPillow>
new bugs are added all the time
<DemiMarie>
What makes it more reliable under Windows/macOS/ChromeOS/Android/etc?
<pepp>
Lynne: thx, I'll open a MR soon
<CounterPillow>
never said it was more reliable there, but for Android/macOS/ChromeOS it's definitely more engineering resources invested
<DemiMarie>
I thought ChromeOS just used upstream drivers.
<CounterPillow>
Not always true, and most importantly they usually do not ship AMD hardware as far as I know
<DemiMarie>
Are Intel’s drivers more reliable?
<CounterPillow>
I don't know since I don't use Intel, but they sure seem to be judging by the number of mpv bugs opened concerning vaapi misbehaving
<mareko>
karolherbst: radeonsi can change BO placement, but if there is not enough memory, it's only a hint
Dark-Show has joined #dri-devel
<mattst88>
CounterPillow: we have AMD Chromebooks nowadays, FYI
<mattst88>
they're using the video encode/decode drivers in Mesa
<CounterPillow>
Boy I sure do hope they pre-validate all input then because you can get 100% repeatable GPU resets by feeding AMD's VCN corrupt H.264 streams
Company has quit [Read error: Connection reset by peer]
<mattst88>
I don't work on the AMD stuff directly, but from what I've heard the video driver stability is not great
<mattst88>
e.g. we split out radv into a separate package that can be updated independently of the radeonsi driver (which is great) and the video driver (which is not great, AFAIK)
<robclark>
_eventually_ we will have some gitlab ci for amd video.. IIRC it is still blocked on some deqp-runner MR
<CounterPillow>
That'd be great, especially if it tests all still relevant generations of VCN (one of the more frustrating parts is reporting a bug and then being told that the AMD engineers don't have that hardware to reproduce it on)
<robclark>
it would be re-using the existing gitlab ci farms... I know we've sent some amd chromebooks to the collabora farm, but I doubt it is exhaustive.
<CounterPillow>
:(
<robclark>
someone sufficiently motivated and w/ enough hw is ofc welcome to host their own ci farms to expand the hw coverage
<CounterPillow>
I am planning on setting up a lava lab eventually but it does feel a bit silly that AMD's driver team does not have access to AMD's hardware
<robclark>
🤷
davispuh has joined #dri-devel
<DemiMarie>
CounterPillow: is it mostly old hardware?
<robclark>
it would ofc be nice if hw vendors ran or sponsored ci farms.. but it can quickly turn into a large project depending on how far back in # of gens you go
<CounterPillow>
In my case, AMD Picasso isn't *that* old, and no, mpv has just had a bug filed caused by a 7900 XT's hardware decoder which is the current gen
<CounterPillow>
I've seen the "sorry we don't have that hardware" response for issues reported on 6xxx series cards, i.e. previous gen
<DemiMarie>
Oh dear
<DemiMarie>
Seems like they only support the most recent hardware generation.
<CounterPillow>
it's not a matter of policy, I don't think
<robclark>
abhinav__: I think daniels is migrating drm-misc to gitlab so that shell accounts will no longer be needed
<abhinav__>
robclark yes, thats why I wanted to check whether the old process of applying for committer access still holds true or what would be the new method .... as only existing committers will be migrated to gitlab not new ones
<robclark>
I think just the last step changes, to gitlab permissions instead of account creation.. hmm, and I guess you just need to configure your ssh pub key in gitlab. Otherwise the process should be the same
<abhinav__>
robclark got it, Yes I have already uploaded my pub keys to my gitlab account ...
ity has quit [Ping timeout: 480 seconds]
<daniels>
yeah, it would just be gitlab permissions so you don't need to fill out most of that form
<daniels>
robclark: afaik we only have stoney
<abhinav__>
daniels got it, so approvals will still happen on that form i assume though?
<daniels>
mripard wanted to try out the 'access request' button, but don't worry, we'll give you access :) and it should be moved early next week
kts has quit [Ping timeout: 480 seconds]
ity has joined #dri-devel
<abhinav__>
daniels thanks :)
heat has quit [Remote host closed the connection]
kts has joined #dri-devel
heat has joined #dri-devel
ity has quit [Remote host closed the connection]
<mareko>
karolherbst: we could also make radeonsi use amdkfd to get system SVM, but it would need a new winsys
ity has joined #dri-devel
<karolherbst>
yeah... system SVM makes implementing all this stuff way easier, but I don't have anything which actually requires it
<karolherbst>
normal SVM is used by SyCL or chipstar (hip on CL), so that's why it's relatively important to support at some point
simon-perretta-img has quit [Ping timeout: 480 seconds]
simon-perretta-img has joined #dri-devel
<agd5f>
CounterPillow, we have access to all generations of hardware in general at least at engineering board level. OEM specific platforms or boards are a different matter.
<CounterPillow>
agd5f: then it is strange to me that I've seen "I don't have access to this hardware" as a response to something multiple times, referring to non-OEM models.
<CounterPillow>
There seems to be some breakdown in communication
<agd5f>
CounterPillow, do you have an example?
<CounterPillow>
No, I don't have links from 6 months ago handy
<agd5f>
CounterPillow, not every engineer has every board, but as a team, we have a hardware library where you can get the boards
simon-perretta-img has quit [Ping timeout: 480 seconds]
<agd5f>
CounterPillow, that said, we have remote developers and it's not always feasible to send them one of every board so sometimes we need to reach out to someone in the office to repo issues, etc. which can take time
<agd5f>
CounterPillow, Thong is an AMD employee and he never said he didn't have access to the hardware.
oneforall2 has quit [Remote host closed the connection]
simon-perretta-img has quit [Ping timeout: 480 seconds]
<DemiMarie>
Do AMD kernel drivers have problems recovering from GPU resets?
oneforall2 has joined #dri-devel
<CounterPillow>
yes
<agd5f>
DemiMarie, they can, depending on the nature of the hang and hardware involved
<DemiMarie>
What is the reason for this?
<DemiMarie>
agd5f: will this be fixed in the future?
<CounterPillow>
I've never seen amdgpu recover successfully on either picasso or a 7900 XT or the zen4 igpu
<llyyr>
DemiMarie: just leave a h264 video playing with vaapi for 5-6 hours, you'll almost defintiely get a reset within that time on a RDNA 2/3 gpu
<llyyr>
a reset that it doesn't recover from, that is
<agd5f>
DemiMarie, It's mostly older hardware. newer stuff should be in pretty good shape
<DemiMarie>
For context: this makes supporting AMD GPUs in virtualization use-cases (with virtio-GPU native contexts) significantly less appealing.
<DemiMarie>
agd5f: is it possible to just do a full GPU reset, wipe out everything in VRAM, give everyone a context lost error, and continue?
<agd5f>
DemiMarie, yes
<agd5f>
but most userspace doesn't handle context lost so even if the kernel resets everything, userspace is left in a bad state
<DemiMarie>
agd5f: does that mean that the zen4 iGPU not recovering is a bug?
<airlied>
daniels: Linus has merged my tree, i can give you a week :-)
<DemiMarie>
agd5f: I see, so that is a bug in all sorts of userspace programs?
<ccr>
uhh.
* DemiMarie
wonders if non-robust contexts should cause SIGABRT when a context loss happens
<agd5f>
DemiMarie, right. On other OSes, the desktop environment is robust aware and if it sees a context lost, it rebuilds it's state, creates a new context and continues
<DemiMarie>
agd5f: I guess that means that bugs should be reported against various Wayland compositors.
<DemiMarie>
agd5f: what is the status of LeftoverLocals mitigations for AMD GPUs?
<agd5f>
DemiMarie, in progress
<DemiMarie>
agd5f: does the hardware make it quite difficult?
<DemiMarie>
IIRC Google shipped something for ChromeOS.
<DemiMarie>
Will there be a way to enforce the mitigations at the kernel driver level?
<agd5f>
DemiMarie, I don't think I'm at liberty to discuss the details at this point
shiva has joined #dri-devel
<DemiMarie>
agd5f: will the details be made available in the future?
<agd5f>
yes
<DemiMarie>
Context: I’m going to be working on GPU acceleration for Qubes OS and working LeftoverLocals protection, preferably at the kernel driver or firmware level, is a hard requirement there.
<DemiMarie>
The reason the location of the mitigations matters is that the userspace driver will be running in the guest, which is not trusted.
simon-perretta-img has joined #dri-devel
<zamundaaa[m]>
<CounterPillow> "I've never seen amdgpu recover..." <- I've seen it recover correctly lots of times with a 6800XT, and also once with a 7900XTX (only reset that has happened on it so far, triggered by Doom Eternal)
<zamundaaa[m]>
You just need to use the one compositor that supports recovering from GPU resets :)
<zmike>
weston ?
<zamundaaa[m]>
KWin
<daniels>
obviously weston uses the gpu so perfectly that we never need to recover
<CounterPillow>
zamundaaa[m]: I use KWin, and I don't think the problem was the compositor considering dmesg kept getting spammed with amdgpu trying to reset
<zamundaaa[m]>
I have seen a reset loop happen once before as well, on the 6800 XT. I thought that was fixed though, it hasn't happened in a while
kts has quit [Ping timeout: 480 seconds]
<agd5f>
I have my doubts as to whether this stuff will ever work very reliably on consumer level Linux in general just due to the nature of the ecosystem. There are tons of distros and they all use slightly different combinations and versions of components and no one can reasonably test all of those, plus all of the OSVs and IHVs focus the vast majority of their testing on their enterprise offerings.
shiva has quit []
<CounterPillow>
Ah, the good ol' "Linux is too diverse to support" excuse when it's surprisingly always your component that crashes.
Duke`` has quit [Remote host closed the connection]
Duke`` has joined #dri-devel
Marcand has quit [Ping timeout: 480 seconds]
<DemiMarie>
agd5f: The only things that should matter here are the KMD version and the firmware version
<DemiMarie>
And the hardware itself, obviously.
ungeskriptet has joined #dri-devel
<zamundaaa[m]>
Mesa can also matter. Until recently, reset handling in RadeonSi was borked
<agd5f>
DemiMarie, and the compositor version and the mesa version and the LLVM version
<CounterPillow>
The long-haired Linux smellies are simply asking too much of us when we have to make sure we don't have bugs in our firmware, our kernel driver, and our userspace driver
<zamundaaa[m]>
agd5f: KWin has supported GPU resets for a loooong time
<zamundaaa[m]>
And it's been 90% functional for almost always. In the remaining cases it would just crash, which is still better than a hang
<DemiMarie>
zamundaaa: What is the consequence of that? Applications not being able to deal with `VK_ERROR_DEVICE_LOST`/`GL_CONTEXT_LOST` reliably?
<zamundaaa[m]>
There were two issues, one was that RadeonSi never reported the GPU reset as being over
<DemiMarie>
agd5f: userspace should not determine whether the KMD can reset the GPU successfully
<DemiMarie>
agd5f: So LLVM problems can be solved by either having Mesa bundle LLVM, or by having Mesa stop using LLVM to generate AMD GPU code.
<zamundaaa[m]>
The other one was related to shared contexts, and meant that after re-creating OpenGL state, KWin would still get the context reset reported by RadeonSi on the new context, despite everything being fine
<agd5f>
CounterPillow, there are combinations of components that work great and others that do not. Say what you will about windows or android, it's a lot easier to test once and verify that it will work everywhere. It's not feasible to test every combination of driver, firmware, rest of kernel, UMD, LLVM, compositor, etc.
<agd5f>
DemiMarie, and we should also bundle kernel and mesa and firmware into one repo as well if we really want to get solid
<zamundaaa[m]>
agd5f: GPU reset handling on the application side is luckily very simple, so there isn't a lot of variation
<zamundaaa[m]>
It's pretty much just if (hasResetHappened()) recreateEglContexts()
<DemiMarie>
agd5f: from my PoV, the obvious solution to this is fuzzing
<agd5f>
I'm not talking about GPU reset specifically, just general GPU stack stability. Like you can have a good combination of KMD and firmware, but if UMD or bad, you'll just keep getting resets
<DemiMarie>
agd5f: what should distros do?
<DemiMarie>
always take the latest kernel and latest Mesa?
<CounterPillow>
not ship AMD code since they're the only ones with this recurrent quality problem
<LaserEyess>
you don't need to test every combination of driver, firmware, software. Test the upstream kernel and upstream mesa, and pick a DE to test, it doesn't matter
<zamundaaa[m]>
CounterPillow: comments like that really don't help
ungeskriptet is now known as Guest688
ungeskriptet has joined #dri-devel
<agd5f>
LaserEyess, sure until distro X decides to pull in a new firmware or stick with an older mesa release, then you have an untested combination
<LaserEyess>
but that's not your problem, and if said distro is doing that then, well, they're doing something wrong
<agd5f>
LaserEyess, but that is what users use
<LaserEyess>
well I'm addressing the point of, for example, amdgpu bugs that are reproducible on drm-tip, or the stable linux kernel, or one of linus's -rc's
<CounterPillow>
Does breaking older user space with newer firmware count as an uapi break or is it fine because it's in firmware?
<tleydxdy>
who are these "users"?
<tleydxdy>
I doubt firmware change should affect anything beyond kmd
<tleydxdy>
if it did it's a kmd issue
Guest688 has quit [Ping timeout: 480 seconds]
ungeskriptet is now known as Guest689
ungeskriptet has joined #dri-devel
ungeskriptet has quit []
ungeskriptet has joined #dri-devel
<DemiMarie>
tleydxdy: those users are people using distros like Debian stable
Guest689 has quit [Ping timeout: 480 seconds]
<tleydxdy>
shouldn't they get support from debian?
<tleydxdy>
like amd is not in the position to do anything
<tleydxdy>
I would think the "users" for amd would be the upstream projects
<tleydxdy>
in that case there's only one support target: "tip of tree"
<DemiMarie>
but that has no humans actually using it, except for dev
<robclark>
DemiMarie: for LL bnieuwenhuizen made a mitigation that clears lmem in mesa.. configured via driconf. This is what we are shipping w/ CrOS but others are free to use it until we get something better from amd
<tleydxdy>
like the other reports help catch bugs that's good, but it's unrealistic to fully support them
<tleydxdy>
I can spin up a distro tmr that only ships known bad configs from vendor X and X would need to support my users?
<DemiMarie>
tleydxdy: “you have to be running this development version to get help” is not reasonable to expect from end-users
<tleydxdy>
yes, but amd also can't ship packages to debian stable
<tleydxdy>
so debian stable need to fix the issue
<tleydxdy>
not amd directly
<tleydxdy>
and if the fix is in upstream, they can backport
<tleydxdy>
"try latest upstream" is a reasonable ask of you are reporting issue to upstream
<LaserEyess>
DemiMarie: the distro is the user, for example ubuntu. When you get a bug on ubuntu, you report it to their issue tracker, and a developer there should be your primary PoC. That developer should be the one coming to AMD if it's an AMD bug, and that developer should be able to run a development system
<LaserEyess>
in fact people pay canonical for that service
ungeskriptet has quit [Quit: Ping timeout (120 seconds)]
ungeskriptet has joined #dri-devel
<tleydxdy>
I mean if you are paying money that's a different story, whoever took your money should make sure you get fixed
<tleydxdy>
if you pay amd a contract then sure, run hanamontana os and get direct support
<LaserEyess>
I"m talking about the support contracts that many linux vendors offer
<LaserEyess>
amd does not offer support for those, the linux vendors do
<LaserEyess>
even free distros have bug trackers
<LaserEyess>
it's the same thing, just with volunteer time and not a contract
<DemiMarie>
tleydxdy: “latest released kernel and Mesa” would be something that is realistic to expect at least some users to run
<DemiMarie>
“tip of tree” isn’t
<DemiMarie>
not least because IIUC neither Linux nor Mesa actually recommend running it
junaid has quit [Remote host closed the connection]
<tleydxdy>
well tip of tree might be a poor choice of word for me. but I was pretty sure e.g. linux would want you try linux-next at least
<tleydxdy>
if you report bug directly to there
ungeskriptet has quit [Quit: Ping timeout (120 seconds)]
ungeskriptet has joined #dri-devel
<tleydxdy>
in any case I don't think hardware vendors should concern themselves with anything other than the latest upstream projects (i.e. direct consumer of their code) when it comes to test coverage. unless they got support contracts that mandate otherwise of course
<agd5f>
ROCm is super stable running RHEL with our packaged drivers. In that case we can make sure you are using a well validated combination of firmwares, driver code, and core OS components because both AMD and RH test the hell out of it. fedora, less so.
<tleydxdy>
yeah, give money to rhel might be the end lesson here
<DemiMarie>
agd5f: From what I have seen, Intel is stable on Fedora, too.
nukelet has joined #dri-devel
<DemiMarie>
What this sounds like to me is that the various interfaces are unstable.
<tleydxdy>
I sure hope fedora gets tested otherwise would't every rhel update be a QA hell?
<DemiMarie>
tleydxdy: that kind of stuff is why Linux on the desktop has a bad reputation
<DemiMarie>
tleydxdy: to me, “latest upstream projects” means “latest release version”
<DemiMarie>
robclark: is this race-free? In other words, is it guaranteed that GPU preemption can’t happen before that command stream finishes?
<robclark>
it clears at the end of each shader, so as long as there isn't mid-shader preemption it should be ok
Duke`` has quit [Ping timeout: 480 seconds]
ungeskriptet has quit [Quit: Ping timeout (120 seconds)]
ungeskriptet has joined #dri-devel
<DemiMarie>
Is mid-shader preemption guaranteed not to happen?
ungeskriptet has quit []
ungeskriptet has joined #dri-devel
<robclark>
better question for someone from amd but I wouldn't expect mid-shader preemption
<robclark>
ie. seems like it would be a hard thing to implement in hw
<agd5f>
DemiMarie, on AMD hardware mid-shader preemption is only supported on the user queues used by ROCm. Kernel managed queues are not preempted
<DemiMarie>
agd5f: is one reason that ROCm queues can be preempted that they do not have access to fixed-function blocks?
<agd5f>
only compute queues support mid-shader preemption. GFX is always at draw boundaries
<agd5f>
due to fixed function hardware
mvlad has quit [Remote host closed the connection]
ungeskriptet has quit [Quit: Ping timeout (120 seconds)]
ungeskriptet has joined #dri-devel
jsa has quit []
ungeskriptet has quit [Quit: Ping timeout (120 seconds)]
ungeskriptet has joined #dri-devel
Marcand has joined #dri-devel
<DemiMarie>
I see.
<DemiMarie>
Hopefully future hardware will support preemption of fixed-function units.
<DemiMarie>
Right now it seems that GFX is a second-class citizen when it comes to robustness.