ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
co1umbarius has joined #freedesktop
columbarius has quit [Ping timeout: 480 seconds]
DragoonAethis has quit [Quit: hej-hej!]
DragoonAethis has joined #freedesktop
marcheu_ has joined #freedesktop
lkundrak has quit [Remote host closed the connection]
mslusarz has quit [Remote host closed the connection]
glisse_ has quit [Remote host closed the connection]
glisse has joined #freedesktop
marcheu has quit [Ping timeout: 480 seconds]
lkundrak has joined #freedesktop
mslusarz has joined #freedesktop
karolherbst_ has joined #freedesktop
karolherbst has quit [Ping timeout: 480 seconds]
lkundrak_ has joined #freedesktop
lkundrak has quit [Read error: Connection reset by peer]
mslusarz has quit [Read error: Connection reset by peer]
mslusarz has joined #freedesktop
glisse has quit [Read error: Connection reset by peer]
glisse has joined #freedesktop
marcheu_ has quit [Remote host closed the connection]
marcheu has joined #freedesktop
Leopold_ has joined #freedesktop
damian has quit [Read error: Connection reset by peer]
<bentiss>
the switch to wireguard-native involve a short downtime while we reboot all the machines
<bentiss>
I'd like also to upgrade rook to 1.11, it has some better metrics gathering, but I'd like to do this once the flapping as stalled
<ishitatsuyuki>
looks like I'm a developer on mesa/mesa, but not actually a member of the group "mesa", so I don't have CI-OK. should I ask to get added to the group "mesa" instead?
<bentiss>
ishitatsuyuki: either mesa sa guest at least, either in mesa/ci-ok, yes
<eric_engestrom>
robclark, MrCooper, bentiss: I was talking about that with someone a couple of days ago because I thought the same, but actually the problem with having the farm toggle in another repo is that it's great for disabling, but for re-enabling if some test got broken in the meantime, when you re-enable the farm you end up breaking random unrelated MRs, so re-enabling needs to go through an MR
<eric_engestrom>
our conclusion was that disabling a farm should be a direct push, which will make the MR currently being merged fail but it would have probably failed anyway since the farm is down, that way it doesn't have to wait for the whole queue to be either processed or manually flushed, and re-enabling a farm needs to be done in an MR so that anything that broke in the meantime gets fixed before it's turned back on
<MrCooper>
hmm yeah, good point
<bentiss>
eric_engestrom: Good point. Maybe I am over-engineering it, but I think it would be nice to have each farm in a child pipeline. If the separate repo says that the farm is offline -> the pipline is not generated for this farm. If it's online but the last main pipeline was allowed-to-fail and failed -> mark as allowed-to-fail, and if the last was a success -> mark as required
<bentiss>
this way, you can disable the farms in a separate repo, each farm gets its own pipeline (so clearer for users), and you can re-enable them but not make them block if the farm has been down for some time
woodwose has quit [Ping timeout: 480 seconds]
<eric_engestrom>
bentiss: I've never user a child pipeline so I have no idea how it works and what its constraints are, but your solution sounds reasonable
<bentiss>
eric_engestrom: we use them in ci-templates if you want to have a look at how they look from the UI point of view
<eric_engestrom>
also, I just caught up with #dri-devel and I see that alyssa already said this there, so s/someone/alyssa/ in my earlier message :P
<bentiss>
eric_engestrom: but the interesting bit is that you can also generate the child pipeline in a job, so it's get dynamic
<eric_engestrom>
which is where "pull this external file and decide whether these jobs exist or not" comes in
<eric_engestrom>
interesting yeah
<bentiss>
exactly :)
thaller has quit [Ping timeout: 480 seconds]
AbleBacon has quit [Read error: Connection reset by peer]
<daniels>
hakzsam: the machine died during the first run, so it started a new one
<hakzsam>
ok
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
<eric_engestrom>
btw, I rewrote the pipeline trace jq script in python so that it fits in ci-fairy instead of making it a job script that each repo would have to copy (and made a few improvements, main ones being making the format compatible with modern perfetto, and adding queue time): https://gitlab.freedesktop.org/eric/ci-templates/-/commits/ci-fairy-pipeline-trace
<eric_engestrom>
but I'm hitting an issue where `when: always` or `rules: [when: always]` doesn't work and the job gets skipped if any previous job is cancelled (I'm cancelling most jobs in my test pipelines to avoid burning resources for nothing): https://gitlab.freedesktop.org/eric/mesa/-/commits/ci-export-pipeline-trace
<eric_engestrom>
bentiss, daniels: ^ since you were interested :)
woodwose has joined #freedesktop
<bentiss>
eric_engestrom: do we care about canceled pipelines though?
<bentiss>
actually.... maybe we can use hookiedookie for that: we register a pipeline event, and when the pipeline is terminated, hookiedookie calls ci-fairy pipeline_trace whcih then adds a comment on the MR
<eric_engestrom>
beyond me testing my code, I was thinking about an MR where "things" take too long and Marge times out, and the user cancels it to avoid burning resources, but we still want to generate the trace to see why "things" took too long
<bentiss>
yeah, so it should be out of the pipeline. A webhook with hookiedookie might do the job :)
<eric_engestrom>
sounds good :)
<eric_engestrom>
I'll MR my ci-fairy change then, and let you convert my mesa ci commit into a hookiedookie thing :)
<bentiss>
heh. no guarantees I can do that today or before next week
<bentiss>
once we got the json output, what tool do you call to get a proper png?
<bentiss>
actually... can't we use gitlab's graph facility instead of a plain json?
<eric_engestrom>
sure why not, but I worry that it might be huge and/or unreadable
<eric_engestrom>
(there's a *lot* of jobs)
<bentiss>
heh, maybe :)
<eric_engestrom>
bentiss: "heh. no guarantees I can do that today or before next week" we've lived without this for a long time, there's no rush :P
<eric_engestrom>
in the meantime any of us can run `ci-fairy pipeline-trace` locally
<daniels>
eric_engestrom: oh nice, thanks!
<daniels>
bentiss: and awesome work finding the s3 issue - the OSDs were really just sniping each other? ha
vbenes has joined #freedesktop
<bentiss>
daniels: I guess it's a combination of a lot of things. But as soon as they start getting flapping, the cluster enters a state where it detects slow ops everywhere.
<bentiss>
I haven't seen a slow op since this morning
<daniels>
\o/
<bentiss>
yep :)
<bentiss>
I'll turn the noup/nodown flags once the backups have terminated
<bentiss>
(and that I removed the last 1TB backup from last week)
thaller has joined #freedesktop
vbenes has quit [Ping timeout: 480 seconds]
lack has quit [Ping timeout: 480 seconds]
vbenes has joined #freedesktop
vbenes has quit [Remote host closed the connection]
vkareh has joined #freedesktop
karolherbst_ is now known as karolherbst
fr0hike has joined #freedesktop
nehsou^ has quit [Remote host closed the connection]
<bentiss>
sigh... I removed the noup/nodown flags half an hour ago, and slowops are popping up :/
woodwose has quit [Ping timeout: 480 seconds]
<bentiss>
HEALTH_OK -> upgrading rook
woodwose has joined #freedesktop
woodwose has quit [Remote host closed the connection]
woodwose has joined #freedesktop
lack has joined #freedesktop
jramsay has joined #freedesktop
lack has quit [Ping timeout: 480 seconds]
karolherbst_ has joined #freedesktop
karolherbst has quit [Ping timeout: 480 seconds]
MrCooper has quit [Remote host closed the connection]
MrCooper has joined #freedesktop
MrCooper has quit [Remote host closed the connection]