macromorgan has quit [Read error: Connection reset by peer]
CME has joined #dri-devel
karolherbst has quit [Ping timeout: 480 seconds]
khfeng has joined #dri-devel
vivijim has quit [Remote host closed the connection]
jcline has quit [Quit: Bye.]
Tito1337 has joined #dri-devel
Tito1337 has quit [Remote host closed the connection]
mbrost has joined #dri-devel
gpoo has joined #dri-devel
sdutt has quit [Remote host closed the connection]
blue__penquin has joined #dri-devel
danvet has joined #dri-devel
noahhsmith[m|gr] has joined #dri-devel
noahhsmith[m|gr] has quit [Remote host closed the connection]
Duke`` has joined #dri-devel
DavidMartin[m] has joined #dri-devel
DavidMartin[m] has quit [Remote host closed the connection]
gpoo has quit [Ping timeout: 480 seconds]
pixelgeek has joined #dri-devel
pixelgeek has quit [Remote host closed the connection]
shankaru1 has joined #dri-devel
dviola has joined #dri-devel
thellstrom1 has joined #dri-devel
thellstrom has quit [Remote host closed the connection]
thellstrom1 has quit [Ping timeout: 480 seconds]
Duke`` has quit [Ping timeout: 480 seconds]
itoral has joined #dri-devel
curro has quit [Ping timeout: 480 seconds]
frieder has joined #dri-devel
dviola has quit [Quit: WeeChat 3.1]
bluestang has joined #dri-devel
bluestang has quit [Remote host closed the connection]
jfb4 has joined #dri-devel
jfb4 has quit [Remote host closed the connection]
blue__penquin has quit [Remote host closed the connection]
blue__penquin has joined #dri-devel
aissen_ has quit []
aissen has joined #dri-devel
pekkari has joined #dri-devel
pnowack has joined #dri-devel
mlankhorst has joined #dri-devel
idr_ has joined #dri-devel
idr has quit [Remote host closed the connection]
profit_ has joined #dri-devel
profit_ has quit [Remote host closed the connection]
rasterman has joined #dri-devel
yk has quit [Remote host closed the connection]
lemonzest has joined #dri-devel
lplc has joined #dri-devel
xp4ns3 has joined #dri-devel
pcercuei has joined #dri-devel
mbrost has quit [Remote host closed the connection]
yk has joined #dri-devel
rgallaispou has quit [Remote host closed the connection]
karolherbst has joined #dri-devel
rgallaispou has joined #dri-devel
blue__penquin has quit [Remote host closed the connection]
blue__penquin has joined #dri-devel
tzimmermann has joined #dri-devel
matt_c has joined #dri-devel
matt_c has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-08 10:15:39)]
hch12907_ has joined #dri-devel
hch12907 has quit [Ping timeout: 480 seconds]
thellstrom has joined #dri-devel
jshmlr has joined #dri-devel
jshmlr has quit [Remote host closed the connection]
blue__penquin has quit [Remote host closed the connection]
blue__penquin has joined #dri-devel
gpoo has joined #dri-devel
macromorgan_ has quit [Remote host closed the connection]
macromorgan has joined #dri-devel
hch12907_ is now known as hch12907
kiero has joined #dri-devel
kiero has quit [Remote host closed the connection]
<mareko>
do modifiers have a way to select optimal tiling for rotated display?
<emersion>
mareko, no
<emersion>
only way to know right now is allocate a buffer with all of the plane's modifiers, do an atomic test-only commit to see if the modifier is supported for rotated planes
<emersion>
and if not, prune the modifier and repeat
heluecht[m] has joined #dri-devel
heluecht[m] has quit [Remote host closed the connection]
<mareko>
emersion: can we not ask drm whether the display is rotated?
<emersion>
mareko: i don't understand that question
<emersion>
user-space decides whether to rotate a plane or not
<emersion>
maybe you can tell more about what you're trying to achieve?
<mareko>
emersion: set optimal tiling for rotated buffers
<emersion>
mareko: okay, so the question is not about finding a *supported* modifier for rotated buffers, but finding the *optimal* one?
<mareko>
yes
<emersion>
because on AMD hw, the optimal modifier is not the same if the buffer is going to be rotated via KMS?
<mareko>
there might also be tiling modes that are supported when non-rotated, but unsupported when rotated
<emersion>
these are two different questions
<mareko>
let's focus on the latter question for now
<emersion>
so optimal modifier
<mareko>
I guess the answer is that it's unsupported
<emersion>
there's no way to do this right now. GBM would need to know that the buffer will be rotated via KMS
<emersion>
however that would leak display engine details into GBM
<emersion>
which isn't what we want in the long run
<emersion>
so GBM would need to know that the buffer'
<mareko>
or Mesa can use rotated tiling if width < height
<emersion>
s consumer prefers a different modifier than what radeonsi prefers
<emersion>
hm. i don't know if heuristics like this are a good idea. daniels?
<emersion>
fwiw, the "find a supported modifier" problem is somewhat easier to solve
<daniels>
emersion: but you already know what I'm going to say :(
<emersion>
daniels, tranches?
<emersion>
mareko, do amd modifiers tell whether the tiling is rotated?
flibitijibibo has quit [Remote host closed the connection]
<pq>
sounds like we would need KMS to give us a set of supported modifiers based on plane configuration
flibitijibibo has joined #dri-devel
<pq>
...in tranches
<emersion>
… and with buffer constraints as well!
<pq>
"optimal" is a hard problem, because there is no component that would be aware of both display and rendering preferences simultaneously and be able to reason about the trade-offs.
<mareko>
emersion: we don't currently expose rotated tiling, but yes, the rotated flag can be extracted from the modifier uint64_t
<bnieuwenhuizen>
mareko: I think the width < height thing isn't going to work for displays that natively are tall
<emersion>
right. the only component in the middle of render and display is the compositor, and it shouldn't have driver-specific logic
<bnieuwenhuizen>
nor for overlays
<emersion>
yeah, for overlays you'll often guess wrong
mp has joined #dri-devel
<pq>
it seems impossible to keep display separated from rendering while needing an optimal solution, so where would we compromise?
mp has quit [Remote host closed the connection]
<pq>
Did even the Unix Device Memory Allocator plans cater for optimal?
<emersion>
is it even possible to take a good decision on e.g. split render/display SoCs?
<pq>
if you define your goals carefully, I'm sure there exists an optimal solution, but...
<pq>
it may end up being holistic, e.g. minimize whole-system power consumption
<emersion>
eh
Lightkey has quit [Ping timeout: 480 seconds]
<pq>
we have to start by defining the problem
<emersion>
i guess we should solve "find a supported buffer" first
<mareko>
there are 2 ways to rotate: 1) while compositing (not viable for fullscreen apps/video) 2) in display hw (any fullscreen app/video or overlay); the hw might require different modifiers for rotated and non-rotated because of the walking pattern thrashing TLB etc.
<emersion>
mareko: i wasn't aware about the "require" part. i thought any non-linear buffer could be rotated by KMS
<emersion>
is it "require" or "prefer"?
<mareko>
require
<mareko>
it depends on the hw, e.g. you might have enough bandwidth/TLB for 4 non-rotated displays, but if you rotate, that numbers might drop to max 2 display if you have incorrect tiling
<emersion>
to solve the "require" part, either do the KMS test-only dance described earlier, or add a KMS API to return a list of modfiiers for a given KMS configuration
<emersion>
ah, interesting
<mareko>
if you connect more displays, it will flicker
<emersion>
to solve the "prefer" part, organize that modifier list in preference tranches
<mareko>
low power devices might require optimal tiling for 1 display
<emersion>
instead of saying "the rotated plane supports modifiers [A, B, C, D]", the driver would say "the rotated plane supports modifiers [A, B] but if you really can't use those it also supports [C, D]"
<emersion>
on top of all of that, add other buffer constraints like alignment etc
<emersion>
then you end up with quite a few lines of code to type to introduce all of the new uAPI
<emersion>
i hope we can work on this step by step
<emersion>
first add a new uAPI to return a list of modifiers for a given KMS configuration
<emersion>
scratch that
<emersion>
first add a new uAPI to check if a given KMS configuration would work, without having to allocate a buffer
<emersion>
then add on top of this new uAPI to return a list of modifiers if the configuration _can_ work
<emersion>
then also return buffer constraints
<emersion>
then also organize all of the returned data into preference tranches
Lightkey has joined #dri-devel
<pq>
mareko, you said "it will flicker". Is the driver not properly rejecting KMS configurations that do not work?
thelounge6753161 has joined #dri-devel
thelounge6753161 has quit [Remote host closed the connection]
<pq>
emersion, we are also supposed to have the global modifier pruning algorithm in compositors with KMS atomic test, and we don't. That would probably go a long way at least in making sure outputs can be lit.
<pq>
it's probably not feasible for fishing out workable plane configurations though, but if the use case is fullscreen apps, might be enough
xp4ns3 has quit []
<emersion>
yeah, per-device modifier pruning would help for the "find something supported but not optimal" part
<pq>
intel has roughly the same problem with its Y_TILING IIRC, right? And that's not even rotated.
<emersion>
yea
blue__penquin has quit [Remote host closed the connection]
<pq>
since mareko said "required", it sounded to me like the problem simplified to "something supported" rather than "optimal from many supported combinations"
<pq>
I guess the first step is to get that modifier pruning going in userspace, before thinking about adding more UAPI.
<emersion>
well
<emersion>
i don't really like it
<emersion>
let's say, i have no plans to implement it
neonking has joined #dri-devel
<pq>
emersion, is having to allocate a buffer to test the biggest problem or something else?
itoral has quit [Remote host closed the connection]
<emersion>
i guess even if we have the enhanced uAPI, we'll still have to have per-device stuff going on
<emersion>
having to allocate a buffer is enough to make me not motivated to fix it :P
<MrCooper>
mareko: why is rotating while compositing not viable for fullscreen apps/video? IME there's only a small performance hit, though I suppose it might be significant for energy consumption
thellstrom1 has joined #dri-devel
thellstrom has quit [Remote host closed the connection]
vivijim has joined #dri-devel
sdutt has joined #dri-devel
sdutt has quit []
sdutt has joined #dri-devel
vivijim has quit [Remote host closed the connection]
vivijim has joined #dri-devel
thellstrom1 has quit [Ping timeout: 480 seconds]
shankaru1 has quit []
fool has joined #dri-devel
fool has quit [Remote host closed the connection]
jcline has joined #dri-devel
<mareko>
MrCooper: I'm assuming the compositor is idle when displaying fullscreen apps/video
<pq>
it's still doing KMS commits every frame at the very least
<MrCooper>
that doesn't answer my question :) yes it can make the difference between the compositor drawing one quad or nothing, but why would the former make what "not viable"?
pekkari has quit []
<alyssa>
@Vulkan spec ninjas: Is it acceptable for an implementation to advertise VK_FORMAT_FEATURE_{SRC,DST}_BIT for linearTiling but not optimalTiling?
<dj-death>
alyssa: there are required features & linear/optimal tilings, so need to check exactly what feature
<dj-death>
and what format
<kusma>
alyssa: Depends on the format, yeah...
<alyssa>
^TRANSFER_{SRC,DST}_BIT
<alyssa>
dj-death: Right, I see the table but can't tell if it's for the union or the intersection of linear/optimal features
<kusma>
Sure you don't mean VK_FORMAT_FEATURE_BLIT_{SRC,DST}_BIT?
<kusma>
alyssa:
<kusma>
For instance, the compressed formats require VK_FORMAT_FEATURE_BLIT_SRC_BIT for optimalTiling if the feature-bit it exposed.
<kusma>
OK, I somehow missed that in the vk header :P
<alyssa>
which has nightmarish interactions with AFBC
<dj-death>
alyssa: how are you going to upload to optimal memory then?
<kusma>
Yeah, that one is probably OK...
<alyssa>
dj-death: er.. glTexSubImage2D....? o:)
<alyssa>
in our GL driver it's handled internally as a blit from a linear staging buffer
<kusma>
alyssa: TRANSFER_{SRC,DST}_BIT is what allows doing that, though...
<alyssa>
hrm.
<alyssa>
One real ugly case is CopyImage between AFBC textures of different formats
<kusma>
"Formats that are required to support VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT must also support VK_FORMAT_FEATURE_TRANSFER_SRC_BIT and VK_FORMAT_FEATURE_TRANSFER_DST_BIT"
<alyssa>
Nod..
jhuizy has joined #dri-devel
jhuizy has quit [Remote host closed the connection]
<alyssa>
Hm.
<alyssa>
Notet to self: CRC is broken when interacting with imageStore()
blaudioslave has joined #dri-devel
blaudioslave has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-08 14:13:41)]
<bnieuwenhuizen>
MrCooper: power usage between the GFX part of the GPU being on and off can be very significant
<danvet>
MrCooper, kernel orders pageflip vs setcrtc on atomic drivers already
<danvet>
they complete in the order you've done the ioctl calls
adjtm has joined #dri-devel
<mareko>
MrCooper: battery life
<mareko>
rotating via a blit is viable until it's not
SpiritOfSummer has joined #dri-devel
SpiritOfSummer has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-08 15:59:08)]
frieder has quit [Remote host closed the connection]
xp4ns3 has joined #dri-devel
<MrCooper>
danvet: I could swear I was able to reproduce the race even on Intel
<MrCooper>
maybe I misremember
<MrCooper>
bnieuwenhuizen mareko: OK, makes sense for fullscreen video where GFX can be off completely; "fullscreen apps/video" just sounded more general :)
<danvet>
MrCooper, you mean 1. page_flip ioctl 2. setcrtc ioctl?
<MrCooper>
yep
<danvet>
and then the page flip completes before the setcrtc and we end up scanning out the wrong plane?
<danvet>
that should be impossible with atomic
<MrCooper>
or maybe it was 1. page flip 2. VT switch
<danvet>
could very well have been busted on all legacy drivers
<danvet>
but I thought we had various "stall for pending page flip" in our crtc disable hooks
<danvet>
but then legacy helpers were funky
<danvet>
MrCooper, should be the same, fbcon/next compositor just does a setcrtc
tweaks has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-08 16:23:19)]
<zmike>
anholt: you have any more comments on !11134 or can it marge
<anholt>
added a comment
jernej has joined #dri-devel
Ben64 has joined #dri-devel
Ben64 has quit [Remote host closed the connection]
Danct12 has quit [Quit: Quitting]
Danct12 has joined #dri-devel
idr_ is now known as idr
gouchi has joined #dri-devel
alanc has quit [Remote host closed the connection]
alanc has joined #dri-devel
<daniels>
mareko: so KMS doesn't tell us whether or not a given modifier is usable together with rotation ... but it also doesn't tell us which dimensions/scaling/etc are suitable for rotation, or whether a given modifier might decrease global availability (Intel Y-tiling vs. FIFO capacity, Rockchip being able to decode AFBC on any plane but only one per CRTC, etc)
mceier has joined #dri-devel
jiggie has joined #dri-devel
jiggie has quit [Remote host closed the connection]
<bl4ckb0ne>
i think i pulled the headers from Vulkan-Headers for 179
<jekstrand>
I don't know why the SDK header and hpp are in there
<jekstrand>
They're not built by VulkanDocs
<jekstrand>
And we don't need either of them
<bl4ckb0ne>
probably a mistake from my part
<jekstrand>
no worries
<jekstrand>
I should really check my update_vulkan_headers script into the tree
<zmike>
what if you just made a cron job to trigger an MR updating the headers every monday? 🤔
<alyssa>
zmike: CI would fail every other monday, though.
<bl4ckb0ne>
huh weird the khronos copyright went back to 2020
thellstrom has joined #dri-devel
<zmike>
jekstrand: btw I ran that failure case you got today about a billion (imperial units) times and it doesn't seem to me that there's any possible way it crashes outside of some kind of spectacular system failure
<jekstrand>
bl4ckb0ne: That's because there's something wrong with your header update
<bl4ckb0ne>
yup
<jekstrand>
bl4ckb0ne: Did you pull master or main?
<bl4ckb0ne>
pulled master instead of main
<bl4ckb0ne>
muscle memory
<jekstrand>
That'll do it
libv_ is now known as libv
thellstrom1 has joined #dri-devel
thellstrom has quit [Remote host closed the connection]
iive has joined #dri-devel
Danct12 has joined #dri-devel
marex_ has joined #dri-devel
marex has quit [Read error: Connection reset by peer]
<bl4ckb0ne>
waiting for CI to finish now
<jekstrand>
bl4ckb0ne: Acked. Feel free to add my tag and marge. Looks good this time.
* jekstrand
rebases VK_EXT_global_priority_query on top
<bl4ckb0ne>
the `Part-of` is added automatically right?
<jekstrand>
yup
<jekstrand>
Marge does that
<jekstrand>
Also, you don't need to wait for CI. Marge will rebase, add the Part-of, run CI and then merge.
<jekstrand>
If CI fails, Marge won't merged.
<jekstrand>
*merge
<bl4ckb0ne>
ill let you assign it to marge, I don't have the rights
<jekstrand>
Ok
<jekstrand>
Let me know when you've re-pushed with my A-B tag
<bl4ckb0ne>
thanks
<bl4ckb0ne>
already did
<jekstrand>
cool[6~
<jekstrand>
bl4ckb0ne: One more problem: XML is still at 179.
<bl4ckb0ne>
oh it is
<bl4ckb0ne>
gitlab was hiding the xml diff
<bl4ckb0ne>
updated it
<jekstrand>
k
<jekstrand>
assigned marge
<bl4ckb0ne>
thanks!
<jekstrand>
bl4ckb0ne: yw.
<jekstrand>
bl4ckb0ne: Out of curiosity, any particular reason why you want 179+?
<bnieuwenhuizen>
obviously both based on a different interpretation of the spec :P
<jekstrand>
bnieuwenhuizen: Do they both pass CTS?
<jekstrand>
bnieuwenhuizen: Feel free to tell me my interpretation is wrong. :)
gouchi has quit [Remote host closed the connection]
<bnieuwenhuizen>
jekstrand: haven't explicitly tested but I believe they both would
<bnieuwenhuizen>
disagreement is whether the driver should also test permissions best effort or if it is really just a "yup these priorities are different"
<anholt>
would love to get yuv import covered on iris with !11193
mbrost has joined #dri-devel
<jekstrand>
bnieuwenhuizen: The intention of the spec is that the client shouldn't even try any priorities that aren't retrieved from the query.
mbrost has quit [Remote host closed the connection]
Plagman has quit []
Plagman has joined #dri-devel
<jekstrand>
It may still get INITIALIZATION_FAILED but only due to a genuine permissions issue and not a "We don't support that priority ever" issue.
mattrope has joined #dri-devel
<jekstrand>
bnieuwenhuizen: There's a part of me that's inclined to do the most useless implementation possible. Just return all the priorities.
<jekstrand>
bnieuwenhuizen: But my current implementation does the useful thing since i915 currently doesn't provide any way besides boosting to get a higher priority on a render node after the fact.
choozy26 has joined #dri-devel
choozy26 has quit [Remote host closed the connection]
Duke`` has quit [Ping timeout: 480 seconds]
<bnieuwenhuizen>
jekstrand: yes, so the two competing implementations we have for RADV are the dumb one and the useful one :P
<jekstrand>
bnieuwenhuizen: Which do you like better?
<bnieuwenhuizen>
I'm like meh
<bnieuwenhuizen>
I confirmed the one user is ok with both
<bnieuwenhuizen>
so half inclined to go dumb just because it is less code
<jekstrand>
That and it forces people to think about what they're doing and not make assumptions.
<jekstrand>
Assuming, of course, that those people care about Linux.
<jekstrand>
But since "those people" are the Android core team.... I'd like to think they do. :)
Daanct12 has quit [Quit: Quitting]
ngcortes has quit [Remote host closed the connection]
karolherbst has quit [Quit: Konversation terminated!]
karolherbst has joined #dri-devel
* dcbaker
cries in we have `Test` in `interpreter.py` and `testSerialisation` in `backend.py`
aswar002 has joined #dri-devel
<jekstrand>
anholt: Is gitlab CI running any Intel Vulkan?
<jekstrand>
anholt: I pushed a Vulkan-only patch and it failed CI:
<airlied>
I think alyssa and zmike lead the conga line
<zmike>
no, no, I stopped being annoyed with it long ago
<jekstrand>
zmike: Did you just stop merging patches?
<zmike>
yea kinda
<zmike>
I try every day or two
<zmike>
sometimes less
<alyssa>
airlied: Merging hundreds of patches a month does that, yes.
<daniels>
jekstrand: the machine tanked at some point during a dEQP run
ngcortes has joined #dri-devel
<alyssa>
jekstrand: Is that an option? Should I try that?
<daniels>
jekstrand: it's failed 2 jobs of the last 200, so I would guess that there's some instability in mainline (or hardware e.g. battery dipping below critical point) that makes it disappear
<daniels>
airlied: either be annoyed with it silently or please at least help out with #3437 (if not the actual issues themselves) so we can all get it stable
<jekstrand>
and... now it's failing name resolution
<daniels>
jekstrand: that's been noted for fix
<daniels>
every single place which touches network has retry for this reason, apart from piglit's internal pull and ci-fairy's result upload
<daniels>
both are going to be fixed in short order
<jekstrand>
daniels: Given that every single item in #3437 is currently checked off except the weird glcpp one no one understands, "help with #3437" isn't a very useful response.
<zmike>
I think he meant in reporting issues
<daniels>
jekstrand: I mean at least dump it into the comments; the summary is laughably out of date for sure, but that's because of the tragedy of the commons, which will not be solved by more tragedy of the commons?
<zmike>
though my proposal to make a new 3437 still stands, as gitlab issues don't seem to have been designed towards having that many comments and it takes like a full minute for that page to load
<daniels>
there's no comment for piglit/ci-fairy not doing retry on network fail, but that has been noted and put on the list for the CI fairies to deal with this week
<airlied>
daniels: hey I think my employer is providing enough support for me to complain out lod
<airlied>
loud
<daniels>
zmike: hmm, 1min is really pathological - it's definitely not quick for me by now, but far from 1min
<daniels>
airlied: *to your employer :P
<zmike>
shrug
<alyssa>
1min sounds right
<airlied>
I expect when CI reaches some form of stability we'll just add a bunch more hw and destabilise it again
<daniels>
I don't mind a new #3437, or trying to see if clicking through resolving all the issues makes it pull quicker on the frontend
<alyssa>
airlied: this
<airlied>
maybe we should find a line where we say enough
<alyssa>
and .. more APIs
<daniels>
airlied: so far that's been the sine wave, eyah
<daniels>
if you want to say enough, subscribe to the CI label for MRs and start drawing the lines
<alyssa>
airlied: ~~and more drivers for the same hw/API~~ oh wait you're crocus COI whoops
<jekstrand>
alyssa: :P
<alyssa>
oh and more test suites for the same system (deqp vs piglit vs khronos cts)
<alyssa>
"hardware x APIs x test suites" has really awful combinatorics.
<airlied>
yeah the 10m holy grail seems to have been dropped also
<airlied>
it's more like an hour if CI doesn't timeout somewhere
<airlied>
ah well back to writing more code to make it take longer :-P
<daniels>
way untrue.
<robclark>
alyssa: tbf the whole "spend a day or two bisecting all the new deqp regressions after you've been away for a week or two" is not the lesser of two evils here ;-)
<alyssa>
robclark: that's /still/ a risk, unfortunately..
<jekstrand>
When CI is working, it takes 10m or so and you get all the tests on all the hardware. It's great.
<jekstrand>
It's not 1hr
<alyssa>
Putting aside any details about this CI,
<alyssa>
"dEQP-GLES2, dEQP-GLES3, dEQP-GLES31, KHR-GLES2, KHR-GLES3, KHR-GLES31, piglit quickgl, piglit cl, dEQP-VK" x "Mali G52, Mali G72, Mali T860, Mali T760"
<airlied>
I suppose it's probably an hour waiting for marge to timeout on someone elses run
<jekstrand>
But when you have machines which are falling off their network or drivers with flaky tests, less great.
<alyssa>
is just the hardware I have to personally care about
<alyssa>
that's really hard!
<daniels>
airlied: an hour being the normal is complete bollocks
<airlied>
daniels: I don't think there is a normal, but 10m runs are rare
<airlied>
esp for anything that hits the hw
<airlied>
but mostly I suppose I've been stuck in marge-bot queues with problems
<daniels>
sure, that's because 10min has crept out to about 15min (which probably needs to be rebalanced), and 10min was never the goal for you assign to marge -> it is merged; it was the goal for the long pole of the last stage of hw testing, so +5min for your actual build
<daniels>
3 weeks of the last 5 have been catastrophically bad, due to a530 test badness + rpi hw badness + fdo storage issues + NM/PipeWire suddenly getting very enthusiastic about very long runtimes, but even with all those, it's still nowhere _near_ an hour on average, even at the very worst peak around European lunchtime
<jekstrand>
Part of the problem is, like politics, IT, and many other things, we only really know it exists when it's failing. The rest of the time, we assign Marge and forget about it. It's hard to get an actual perspective on it thanks to human psychology.
iive has quit []
<airlied>
jekstrand: indeed, as in when I check back 30mins later and it's all merged I forget about it
<airlied>
when I check back 30m later and it's behind marge saying CI took too long I register it
<airlied>
maybe we just need hw to spend longer in staging areas before going live or be quicker to yank it out completely if it starts flaking
<daniels>
jekstrand: luckily it is really easy to get an actual perspective on it by using the various scripts and snippets people have posted to query the pipelines and make your own analyses on fail rates (doing manual filters for when it catches legit fails, which is way more than you think) or mean/median/mode end-to-end time, or ...
<daniels>
airlied: we do both
<zmike>
here's a random q: would it be possible to set like a 30min total time threshold on a job and kill it if it exceeds that?
<daniels>
new hw lives in manual runs for a bit, and it also gets smashed to bits by manually-triggered runs, and then the people responsible for it sit on the results and look for patterns
<zmike>
I think that'd at least cut down congestion for flake jobs
<daniels>
it also gets yanked by people being yelled at about failures, mostly automatically but
<daniels>
zmike: right now it's 60m, I think 30m is p reasonable for a per-job timeout
<daniels>
would A-b a MR to tune it down
<zmike>
tbh I feel like 20 min should be reasonable, but idk what actual total times are like
<daniels>
well, it's a balancing act
<daniels>
when you have 10-12min runtimes, 20min runtimes puts you at the risk for spurious failure when you kill it at 99% because it's been griefed by load
<daniels>
then someone has to reassign it to marge and this is further proof CI is awful :P
<zmike>
yea that's why I figured 30 should be safe
<daniels>
seems p reasonable
<daniels>
60 covered a multitude of sins to begin with, but we're getting aggressive enough with the long tail of fail that 30 should work
mattrope has quit [Remote host closed the connection]
<daniels>
anyway, you're all Mesa developers, if you want something changed in Mesa then you're just as free to float MRs to change it as anyone else ...
<zmike>
DONE
<alyssa>
zmike: reviewed.
<zmike>
oof and ci exploded already
<alyssa>
How does the kernel handle developer volume vs regressions?
<daniels>
it doesn't
<robclark>
badly?
<alyssa>
right... my audio drivers can attest to that :(
<daniels>
y
<alyssa>
speaking of is rk3399-gru-sound still broken? yes, very much so.
neonking has quit [Remote host closed the connection]
<airlied>
alyssa: the kernel just YOLOs
<daniels>
merged on a whim / no regressions / no central dictation of development priority
<daniels>
pick any two
<airlied>
I pick pikachu and squirtle
<alyssa>
pika pika!
<bl4ckb0ne>
thanks
<daniels>
airlied: glx@glx-wait-msc,Timeout
mattrope has joined #dri-devel
<robclark>
it could perhaps be useful to have a step between manual pipelines and must-pass-to-merge.. ie. for new or less stable hw, etc.. run the CI job but don't block the CI.. that would at least give some more testing, and a chance for someone to look at the result and decide "yeah, you actually broke that hw, I'm reverting your MR"
<jekstrand>
If machines are falling off the network, I'm not sure that helps. If we have flaky drivers, then, yeah, they shouldn't be in must-pass CI.
<robclark>
well, pass or fail is sum total of the flakes, but yeah if test infra issues then that isn't a justification to revert an MR
<daniels>
jekstrand: in the gap between those two absolutes is the internet
<daniels>
jekstrand: unclear whether it's USB ethernet badness, or cable badness, or switch badness, or just the internet being painful; my guess is on both #1 and #4
<jekstrand>
daniels: Yeah, having the entire internet in the middle doesn't help.
<airlied>
but also usb ethernet
<daniels>
airlied: Chromebooks
<daniels>
would you prefer wifi? :P
<bnieuwenhuizen>
ethernet over serial :P
<robclark>
we can use LTE now :-P
<daniels>
...
<jekstrand>
But given that I've seen 3 "fall off the network" fails from the same class of machine in an hour likely means it's not just the internet, generally.
<alyssa>
If a machine doesn't get to userland (network failure, boot fialure, ....), methinks it should be skipped instead of blocking CI
<jekstrand>
robclark: Don't use the 5G. Didn't you hear? It gives you coronavirus.
<robclark>
:-P
<daniels>
alyssa: it _does_ get to userland
<alyssa>
jekstrand: If you're double vaccinated you can use the 5G, the CDC said so.
<airlied>
daniels: it would be lols if wifi was more stable than the usb ethernet
<daniels>
alyssa: it fails long after it's got an IP through DHCP, pulled the rootfs, etc etc; at some much later stage, a random request fails
<daniels>
alyssa: it already does get silently retried behind the scenes if it fails to make it as far as executing tests
<alyssa>
daniels: Sure, let me amend that -- the/only/ reason pre-merge should fail is if a job actually reported that there are a test regression.
<daniels>
airlied: you've used Intel wifi, right?
* airlied
is doing crocus testing over wifi because I don't have enough ethernet cables :-P
<daniels>
alyssa: I'm on the fence about that
<alyssa>
Arguably CI reports should reallu be a tristate, "definitely passes", "definitely fails", and "inconclusive"
<daniels>
sure
<daniels>
but what do you do with inconclusive?
<alyssa>
Right now we're mapping inconclusive to fail and it's causing burnout.
<jekstrand>
daniels: Rather wifi? That depends on the USB ethernet. If it's the one in my USB-C dock, it routinely falls off the bus if I try to do something heavy like, say, rsync a kernel build.
<airlied>
you map inconclusive to reassign to marge-bot :-P
<airlied>
and watch it loop until someone notices
<alyssa>
I would say report a CI warning in gitlab, but marge bot means that nobody actually looks at CI results unless the job fails.
<daniels>
airlied: or, instead of wasting rebuild cycles and everyone's time, you keep on working your way through the causes of spurious fails like people have been, and you do things like insert retries into network ops
<daniels>
alyssa: I don't see how mapping inconclusive to success is any less frustrating
<daniels>
alyssa: as you say, no-one will ever look at or care about anything unless it's visibly in their face
<jekstrand>
airlied: I might have bought that once or twice and have a 16-port switch on my desk.....
<alyssa>
daniels: It's visibly in the /wrong/ person's face
<airlied>
jekstrand: my 16 port switch needs more ports, though I suppose I could put crocus machines on a 100mb 8-port
<alyssa>
The intersection of "understands arcane NIR details" and "understands arcane LAVA details" is... er, Emma
<daniels>
alyssa: so Mali boards flake 30% of the time and no-one really cares, then jekstrand marges a core NIR change which genuinely breaks Panfrost, it flakes its way through to happy inconclusive success, and the next time you try to merge a non-functional-change Panfrost change, it gets rejected because surprise, NIR is broken
<daniels>
(smash retry until it passes)
<jekstrand>
airlied: I'm pretty sure Amazon can fix that one for you too. :)
<alyssa>
So why when there's an infrastructure problem triggered in a random nir MR is the person getting hurt by that the NIR author, not the LAVA one?
* jekstrand
would hever merge a change which breaks panfrost. All my patches are perfect!
<alyssa>
jekstrand: luv you
<airlied>
jekstrand: I'm trying to get it to fix my no cherryview problem first :-P
* jekstrand
waits for craftyguy to show up and murder him.
<alyssa>
daniels: Are there any Panfrost driver flakes extant?
<daniels>
alyssa: if there are spurious fails, someone whose fault it isn't gets hurt
anarsoul has quit [Ping timeout: 480 seconds]
<alyssa>
or @ anyone, if you see Panfrost flake (and not the LAVA farm infrastructure issues, genuine spurious fail in the dEQP report), please tell me so I can deal with it, and I'm sorry in advanced if this happens, that's on me.
<daniels>
either that's the immediately proximate person (who can retry for a less-magic-8-ball result), or it's some distant person in the future who has less chance of obtaining an actual answer
mcan06[m|gr] has joined #dri-devel
<daniels>
alyssa: I mean likewise, please do be telling the infra people if there are infra failures so we can deal with it, and we're sorry in advance if that happens, that's on us :)
mcan06[m|gr] has quit [Remote host closed the connection]
<jekstrand>
Well, most of the fails I'm seeing today are APLs falling off the internet.
<jekstrand>
Which I guess counts as infra
<jekstrand>
Not sure whose infra or what the infra problem is.
<jekstrand>
For all I know, someone's cat has been chewing on the USB adapter
<alyssa>
Meow.
<daniels>
jekstrand: ours, undefined network, if it's showing up frequently enough to be a blocking issue for you then please assign Marge an MR which disables those jobs with my R-b
<kisak>
I've had more internet failures from rabbits eatting fiber optic line than anything else
<daniels>
and we'll bring it back when we're confident that we've bottomed out whatever issue it has been
<alyssa>
!11246 flaked for the 4th time today.
<alyssa>
ths time it's iris-apl-egl
<alyssa>
disabling that job brb
* jekstrand
is trying to figure out how to disable jobs
<alyssa>
Prepend the job name with a dot
<daniels>
^
<alyssa>
MR submittted.
<daniels>
the fact that 100% of the APL failures have been network errors at the very end of the job, and that it's isolated to APL rather than any of the machines in the same rack/room/building, makes me think that there's some kind of USB autosuspend badness going on
<jekstrand>
makes sense
<jekstrand>
alyssa: MR?
<alyssa>
!11255
<alyssa>
I personally saw -egl fail but dropped all the APL jobs for good measure if it's infra
<jekstrand>
I've seen -gles3 fail too
<daniels>
if it's infra, it's not going to be API-sensitive
<jekstrand>
yup
<alyssa>
daniels: I guess where we're coming from is just the combinatorics. If a given machine fails once a year, but it takes a day to deal with in total, given we have hundreds of machines that means CI is broken almost always.
<alyssa>
(Hundreds? I've never counted but I imagine across all the different labs and gitlab runners it adds up.)
<alyssa>
(Maybe 100? Numbers are hard. The point still stands.)
<daniels>
again 'almost always' is a million miles away from the actual numbers
<alyssa>
So we're very weary of the strategy of just fixing more problems because the root cause isn't a particular network failure, it's that at the scale we do CI there will /always/ be problems
<daniels>
I know
<alyssa>
Would it help if I keep a log from a dev point of view of MRs I merge and the outcomes?
<jekstrand>
I think marge already keeps that log for us
<daniels>
yep, and you can also pick up the scripts already posted to do some graphing and analysis if you want to get fancy with it
<daniels>
(as long as you have a manual filter for jobs which are legit fails)
<alyssa>
that's the rub..
anarsoul has joined #dri-devel
<alyssa>
Looking at my recent merged MRs labeled with NIR --
<alyssa>
!11199 went in one try
anarsoul has quit [Remote host closed the connection]
<alyssa>
!10411 2 tries, dEQP-GLES31 flake on Panfrost which tomeu/bbrezillon found, related to indirect draw stuff
<alyssa>
!10022 one try
anarsoul has joined #dri-devel
<alyssa>
!10601 2 tries but not obvious if that was flake or fail
<alyssa>
!10578 2 fails
<alyssa>
er
<alyssa>
er 1 try, sorry mixed up
<alyssa>
!10391 1 try.
<alyssa>
ok, so that's nearly as bad as it feels.
<alyssa>
although...
andrey-konovalov has quit [Ping timeout: 480 seconds]
aswar002 has quit [Ping timeout: 480 seconds]
<alyssa>
!10022 was over an hour in the queue before marge pushed any commits
<alyssa>
!10601 was about 45 minutes in the queue before getting a commit pushed for one of the two tries
<robclark>
there are defn times when lot of MRs are submitted in short order, and those 10-15min's add up