ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
<HdkR>
kiroma: It covers a lot of things, so it needs to be large. But running subtests can help reduce the workload when you're poking around
tobiasjakobi has joined #dri-devel
Danct12 has joined #dri-devel
aravind has joined #dri-devel
pcercuei has quit [Quit: dodo]
tobiasjakobi has quit []
glennk has quit [Ping timeout: 480 seconds]
Kayden has quit [Quit: -> home]
aravind has quit [Ping timeout: 480 seconds]
aravind has joined #dri-devel
flynnjiang has joined #dri-devel
columbarius has joined #dri-devel
<kiroma>
All sanity tests were skipped bar one which failed because `Test requires OpenGL compat Shader Language version 1.1, but only 0.0 is available`
<kiroma>
I... haven't even ran devenv yet? I'm not sure why this is failing.
co1umbarius has quit [Ping timeout: 480 seconds]
kts has joined #dri-devel
kts has quit []
yyds has joined #dri-devel
<kiroma>
Oh wow I was missing waffle-utils, and somehow python was succeeding in executing something that didn't exist
yuq825 has joined #dri-devel
Jeremy_Rand_Talos__ has quit [Remote host closed the connection]
Jeremy_Rand_Talos__ has joined #dri-devel
crabbedhaloablut has quit []
flynnjiang has quit [Read error: Connection reset by peer]
kts has joined #dri-devel
YuGiOhJCJ has joined #dri-devel
kts has quit [Ping timeout: 480 seconds]
Kayden has joined #dri-devel
flynnjiang has joined #dri-devel
luben has quit [Ping timeout: 480 seconds]
YuGiOhJCJ has quit [Remote host closed the connection]
YuGiOhJCJ has joined #dri-devel
oneforall2 has quit [Remote host closed the connection]
oneforall2 has joined #dri-devel
kiroma has quit [Quit: Konversation terminated!]
heat_ has quit [Ping timeout: 480 seconds]
lplc_ has joined #dri-devel
lplc has quit [Ping timeout: 480 seconds]
Company has joined #dri-devel
asriel has joined #dri-devel
bmodem has joined #dri-devel
Jeremy_Rand_Talos_ has joined #dri-devel
Jeremy_Rand_Talos__ has quit [Remote host closed the connection]
ascent12_ has joined #dri-devel
luben has joined #dri-devel
ascent12 has quit [Ping timeout: 480 seconds]
bmodem has quit [Quit: bmodem]
bmodem has joined #dri-devel
flynnjiang has quit [Quit: flynnjiang]
luben has quit [Ping timeout: 480 seconds]
yyds has quit [Remote host closed the connection]
larunbe has quit [Ping timeout: 480 seconds]
Duke` has joined #dri-devel
yyds has joined #dri-devel
i-garrison has quit []
i-garrison has joined #dri-devel
glennk has joined #dri-devel
tzimmermann has joined #dri-devel
gage has joined #dri-devel
fab has joined #dri-devel
lina has joined #dri-devel
co1umbarius has joined #dri-devel
macslayer has quit [Remote host closed the connection]
fab has quit [Ping timeout: 480 seconds]
columbarius has quit [Ping timeout: 480 seconds]
Daanct12 has joined #dri-devel
kzd has quit [Ping timeout: 480 seconds]
sima has joined #dri-devel
yyds has quit [Remote host closed the connection]
yyds has joined #dri-devel
Ahuj has joined #dri-devel
KetilJohnsen has quit []
tursulin has joined #dri-devel
crabbedhaloablut has joined #dri-devel
jfalempe_ has left #dri-devel [#dri-devel]
jfalempe has joined #dri-devel
<jfalempe>
tzimmermann, there was a mistake in my latency patch for mgag200, I will send a v2 soon. (the vmap parameter used to be a void *, and is now an iosys_map).
<jfalempe>
for the conditional, should I keep it under CONFIG_PREEMPT_RT, add a module parameter, or just do it unconditionally, since the performance impact is minor (less than 1% on my testing).
i509vcb has quit [Quit: Connection closed for inactivity]
aravind has quit [Ping timeout: 480 seconds]
rasterman has joined #dri-devel
aravind has joined #dri-devel
<tzimmermann>
jfalempe, sorry. i've not been around much in the last weeks
<tzimmermann>
i've just read your reply to my review of that patch.
<jfalempe>
tzimmermann, ok np. There was not much reaction from rt-kernel either.
<jani>
hey all, I'm trying to move some intel docs from drm/intel wiki to a sphinx project. any objections to having the repo under drm/intel-docs in gitlab?
<jfalempe>
tzimmermann, but we have some users that are stuck on older kernel because of this, and I've to do something about it.
<sima>
jani, want me to create something and make you maintainer? or all the usual intel suspects?
<sima>
we unfortuantely can't import access lists from another repo in gitlab, only from another group :-/
<sima>
jani, plan B would be to create the intel group, put the docs in there and use that intel group to mass-add people everywhere ...
<jani>
sima: idk, might want to go with separate permissions for docs anyway
<sima>
jani, aye
<sima>
so want me to create drm/intel-docs and hand it to you?
jsa has joined #dri-devel
<tzimmermann>
jfalempe, but i think you slightly missed the point of my reply. it's not a question whether the matrox is slow. that affected server is IMHO mis-designed for an RT system. even on cold caches, the system should work. one usually gets the jitter out by assuming worst-case execution times, or doing tricks like cache coloring. apparently neither is the case here. so if we paper over the matrox issue today, that problem
<tzimmermann>
later comes back within the file system or the memory allocator, or whatever else requires pages. i'm not entirely opposed to the patch, but i'd really want to here some experienced RT dev's opionion on that problem first.
<jani>
sima: yeah, I'd like that if it's okay with everyone *waves hand across the channel*
<sima>
jani, intel-doc or intel-docs?
<tzimmermann>
and how do they deal with system management mode?
<jfalempe>
tzimmermann, those server are running for years with RT loads without issue. Only upgrading the Matrox driver causes them problems.
<sima>
also looks like you can be owner of a repo now too, not just a group?
<sima>
that feels new ...
* jani
bows
<jani>
thanks!
<jfalempe>
tzimmermann, for the system management mode, I'm not sure what it is, and how it can impact RT tasks.
<sima>
jani, ah looks like the various delete permissions (for issues, mr and project itself) where put into the owner role and hence can also be owner for projects now directly
<sima>
I guess that's useful for code of conduct enforcement ...
<kode54>
please put back the space bar heating
<tzimmermann>
jfalempe, every few milliseconds, os so, your x86 cpu will put aside all work and run some internal tasks for house keeping. that's system management mode. the os cannot predict or avoid that. it makes RT on x86 sort of complicated.
<sima>
jani, dolphin, rodrigovivi, tursulin upgraded you all to owner for drm/intel so you can delete stuff
<jfalempe>
tzimmermann, I think they use isolated CPU, so they have no other IRQ/task running on those CPU core. Maybe they can make sure the SMM will not run on those core too ?
<sima>
agd5f, hwentlan__ upgrade you + christian for drm/amd to owner so you can delete stuff (like issues violating CoC)
<tzimmermann>
jfalempe, it's an NMI. the point of SMM is that you cannot avoid it from the OS. maybe the architecture devs know more details
<sima>
ivyl, Adrinael done the same for igt for all maintainers (no idea about the irc nicks of all the newer people)
<sima>
robclark, seanpaul_ abhinav__ you're also upgraded to owner for drm/msm so you can delete issues/mr if necessary
<javierm>
tzimmermann: I don't know how they deal with SMM but until recently, RT completely disabled EFI runtime services, see d9f283ae71af ("efi: Disable runtime services on RT")
<javierm>
and a031651ff214 ("efi: Allow to enable EFI runtime services by default on RT")
<sima>
karolherbst, same for drm/nouveau (do we need that or can we just make drm/misc happen ...)
<javierm>
tzimmermann: since jfalempe's patch is restricted to RT, I don't see why couldn't be merged if it fixes a real issue on that platform
<jfalempe>
tzimmermann, I will ask, and see how they managed that.
<tzimmermann>
one the argument of "it has worked for years": a correct RT system guarantees to make it's deadlines for the RT tasks; no matter what the best-effort tasks do. it's not a kernel issue. it's an issue of system design.
<sima>
jfalempe, the real fix is the printk locking rework rt people are working on
<sima>
john oggness is the main contact, #linux-rt here for status
<tzimmermann>
sima, thank you
<sima>
atm printk is a giantic lock and it's absolute suck
<sima>
unless it's not printk being slow, but the hw being really funny
<tzimmermann>
right, that's what i heard
<karolherbst>
sima: good question...
<jfalempe>
tzimmermann, yes I think it's an issue with the way the Matrox is connected to the system. But I don't have a clear answer. it's the only component able to break the RT tasks it seems.
<karolherbst>
I think if drm/misc accepts MRs we'd be fine moving to drm/misc entirely
<sima>
jfalempe, could you try without that patch, fbcon fully disabled but fbdev enabled and use some userspace fbdev program to write into the framebuffer?
<jfalempe>
tzimmermann, but even if it's a hardware bug, that needs a software workaround, I think our driver are full of that.
<sima>
that should side-step any fbcon locking issues and allow us to purely observe anything funny by the hw
<sima>
fbtest or something should be able to write stuff to the fbdev mmap
<tzimmermann>
jfalempe, your patch does not touch the hardware. AFAICT you're flashing the pages of the GEM buffer in system memory
<javierm>
jfalempe, sima, tzimmermann: maybe a dmi match table to do the flush?
<javierm>
it's a workaround yes, but at least will be constrained to a single platform
<sima>
javierm, imo first make sure we don't paper over a fundamental sw issue somewhere else with this
<javierm>
sima: fair
<jfalempe>
tzimmermann, yes, that the weird thing about it.
<sima>
because this looks extremely funny at best :-/
<sima>
jfalempe, does a program which has a mutliple of the cpu cache size allocated and just thrashes that in a loop also break things?
<tzimmermann>
sima, that's excatly my point. it doesn't look like a kernel bug. it loks like a system design issue that just manifests in the mgag200 driver
<sima>
at appropriately low priority ofc
<jfalempe>
sima, flooding fbcon with the patch, is working great.
<sima>
tzimmermann, yeah smells a bit like cpu cache thrashing
<jfalempe>
like doing cat /dev/urandom | base64 in the fbcon terminal.
<tzimmermann>
sima, this. i think the system draws to that buffer and flushing the cache simple cleans up the dirty cachelines
<sima>
and as long as the damage helper gets run often enough so that only ever a small part of the shadow fb is loaded into cpu cache the hack works
<sima>
but if you do a full screen clear then the cpu cache is busted again and flushing it all out wont help
<tzimmermann>
and as GEM BOs are large, it's easy to trash the whole cache
<sima>
yeah
<sima>
if that's the case then disabling fbcon would be the fix
<sima>
because that's the only way to make sure fbcon doesn't thrash the cpu cache at a bad point
kts has joined #dri-devel
<jfalempe>
sima, flooding fbcon does a full redraw, and a full framebuffer flush.
<sima>
jfalempe, yeah but does that break your w/a?
<tzimmermann>
hence my comment that a well-designed RT sstem should not behave like that
<jfalempe>
sima, no it's working in this case.
<sima>
huh
<sima>
this just went straight to wtf territory ...
<tzimmermann>
interesting
<sima>
I think even more reasons to test with userspace mmap and see whether that makes any difference
<sima>
and also whether just thrashing cpu caches in general is the issue or not
<tzimmermann>
jfalempe, can you test full-screen updates with a randome-access pattern?
<jfalempe>
sima, ok, I will ask for more tests, since I don't have access to these servers.
<tzimmermann>
such as filling pixels in random locations on the screen
<tzimmermann>
jfalempe, doing linear access might trigger HW-internal optimizations
<jfalempe>
what they have done is filling the terminal with " cat /dev/urandom | base64", that should be close to random pixels ?
<tzimmermann>
jfalempe, no you're still writing the framebuffer memory from top to bottom
<jfalempe>
hum, ok you want random damage in the framebuffer ?
<tzimmermann>
what i means is to really access random pixels
<tzimmermann>
or at least random characters
<sima>
jfalempe, oh was that just on the console? I thought the issue was printk
<tzimmermann>
yes, to avoid linear access.
<jfalempe>
sima, yes using the console also leads to this problem.
<sima>
hm ...
<sima>
otoh you can still run into console_lock contention
<jfalempe>
even the blinking cursor is enough to make the RT tasks fails (even if that's a very small amount of pixels).
<sima>
jfalempe, direct fbdev mmap with fbtest or similar would still be interesting, since that bypasses console_lock
<sima>
jfalempe, huh
<sima>
jfalempe, might be good to jump over to #linux-rt and ask for debug ideas/tools there too
<sima>
maybe after we've figured out whether it's related to console_lock in any way or not
<sima>
since a few cachelines for redrawing the cursor really shouldn't make anything else hit a deadline
<sima>
unless the deadline is way too close already
<jfalempe>
the thing is the RT task is running on a dedicated CPU core, there is almost no linux kernel code running on it. But for some reason other CPU are affected by the framebuffer draw.
<tzimmermann>
jfalempe, as you've noted yesterday. we're doing quite a bit of vmap/vunmapo in the kernel address space. IDK maybe that has an impact on RT as well
<sima>
tzimmermann, yeah but it doesn't seem to be the lack of vmap/vunmap, just the flushing that makes a difference ...
<jfalempe>
tzimmermann, surprisingly that was beneficial for the RT tasks,
<tzimmermann>
jfalempe, but it's not the RT task that does the print, right?
<sima>
jfalempe, small userspace tool which simulates the cursor drawing access pattern would also be interesting ...
<jfalempe>
tzimmermann, yes it's on another core, the print can't run on the RT core.
<sima>
yeah if the rt task does any printk it's game over with the current console locking
<jfalempe>
I don't think it's an issue with locking, it's mostly the cache or external bus, that can affect other cores like this.
<tzimmermann>
jfalempe, it that a NUMA system? do some of the CPU cores share some of the memory bus or L2/L3 caches?
<tzimmermann>
that could be a cause for interference
<sima>
jfalempe, the cursor is 2 cachelines redrawn once per second ...
<sima>
or maybe 4 if it's crossing over
<sima>
so 256 bytes
<sima>
if that's enough, then something very funny is going on ...
<sima>
that's like a few % at most of a modern cpu's L1
aravind has quit [Remote host closed the connection]
aravind has joined #dri-devel
<sima>
jfalempe, btw if you don't have fbtest or similar handy on that server then just writing directly into the fbdev /dev node should work too
<jfalempe>
tzimmermann, I didn't find which server they are using on logs. I think it's some standard one.
<sima>
don't need to mmap
<jfalempe>
sima, I think at some point I ask them to write directly to /dev/fb0
<sima>
so 1. completely disable fbcon in .config and 2. write stuff to /dev/fb/0 to simulate what fbcon would do
<sima>
jfalempe, yeah but need to make sure fbcon is completely out of the picture, otherwise it's not very interesting experiment
<jfalempe>
sima, ok, I'm trying to summarize that, and that will take a few days before having the answer.
<jfalempe>
Thanks tzimmermann and sima, I hope this will shed some light on this issue.
<javierm>
jfalempe: git://git.kernel.org/pub/scm/linux/kernel/git/geert/fbtest.git has tests for random access AFAIK
oszi has left #dri-devel [#dri-devel]
aravind has quit [Ping timeout: 480 seconds]
<jfalempe>
javierm, thanks, I will see we can make use of it.
AnuthaDev has quit []
pixelcluster has quit [Ping timeout: 480 seconds]
yuq825 has quit []
gage has quit [Remote host closed the connection]
bmodem has quit [Ping timeout: 480 seconds]
<zamundaaa[m]>
MrCooper: interesting. Is there anything that can be done to avoid that with KMS though, short of fixing the kernel? It's not like we can disable implicit sync for atomic commits, right?
ptrc has quit [Remote host closed the connection]
ptrc has joined #dri-devel
yyds has quit [Remote host closed the connection]
<emersion>
does IN_FENCE_FD disable implicit sync?
glennk has quit [Ping timeout: 480 seconds]
<emersion>
GL has exts to disable implicit sync but not supported by Mesa sadly
yyds has joined #dri-devel
yyds has quit [Remote host closed the connection]
yyds has joined #dri-devel
vliaskov has joined #dri-devel
<jani>
sima: I'm positively surprised that having set up the gitlab CI in my personal repo, just pushing it to the new one made it all work
<jani>
sima: even though most of it was just cargo culted :D
kts_ has joined #dri-devel
kts has quit [Ping timeout: 480 seconds]
kts_ has quit []
glennk has joined #dri-devel
<MrCooper>
zamundaaa[m]: yes, IN_FENCE_FD
djbw has quit [Read error: Connection reset by peer]
<MrCooper>
the issue reporter successfully tested a proof-of-concept patch for that
<emersion>
nice
<zamundaaa[m]>
Cool. I'll hook that up in KWin too then
swick[m] has joined #dri-devel
Jeremy_Rand_Talos_ has quit [Remote host closed the connection]
Jeremy_Rand_Talos_ has joined #dri-devel
<swick[m]>
emersion: btw, thanks for pushing dma-buf heaps. I really think they are how we can solve the generic allocation stuff in the long term...
rasterman has quit [Quit: Gettin' stinky!]
<javierm>
emersion: thanks a lot for your r-b. I wasn't sure if got all the terminology correct :)
<enunes>
emersion: one thought I have in mind with the dma_heap solution is, we might still need to carry the current solution/workaround in mesa for a while even if we land that right? otherwise the driver will stop working in kernel versions before the one which has that
<emersion>
yes, we will
AnuthaDev has quit []
<enunes>
too bad we still wont be able to get rid of it