ChanServ changed the topic of #freedesktop to: infrastructure and online services || for questions about projects, please see each project's contact || for discussions about specifications, please use or
co1umbarius has joined #freedesktop
columbarius has quit [Ping timeout: 480 seconds]
<daniels> Venemo: …
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
jarthur has quit [Quit: Textual IRC Client:]
johnandmegh has joined #freedesktop
jarthur has joined #freedesktop
krushia has joined #freedesktop
ximion has quit [Quit: Detached from the Matrix]
craftyguy has quit [Ping timeout: 480 seconds]
damian has quit [Read error: Connection reset by peer]
damian has joined #freedesktop
johnandmegh has quit []
craftyguy has joined #freedesktop
<dj-death> anybody knows what's up with navi21 builders?
feto_bastardo has quit [Quit: Ping timeout (120 seconds)]
feto_bastardo has joined #freedesktop
<MrCooper> Venemo: Marge isn't actively involved with the CI in any way, she's just the messenger
<daniels> dj-death: that’s one for mupuf
<mupuf> dj-death: hmm, let me check
<mupuf> I see, 100% failure rate in the past night!
<mupuf> let me reboot the gateway!
tzimmermann has joined #freedesktop
AbleBacon has quit [Read error: Connection reset by peer]
<mupuf> it's back up
<mupuf> and seems to be working
<mupuf> sorry about this
slomo has quit [Quit: The Lounge -]
slomo has joined #freedesktop
<mupuf> For the records, this failure mode was never seen before... hence why my monitoring did not alert me
<mupuf> this should be fixed now
<mupuf> \o/ * keywords-gfx10-navi21-3: EXECUTOR_SETUP_FAIL, EXECUTOR_DOWN -
<mupuf> caught in two ways now :D
<bentiss> daniels: ping... not sure if you had time to read my long writeup from yesterday, but still asking: anything against using Fedora CoreOS as the base distro for runners and k3s?
<emersion> a bit worried about stability and ability for others to jump in
<emersion> the current infra is a lot to learn, if new admins also need to learn coreos along the way…
<bentiss> well... The more I think of it, the more I believe we should be running gitlab-runner in a container, so there is not much to learn IMO, it'll be a podman container controlled by a systemd unit, "almost" like today :)
<bentiss> emersion: and basically coreos is "do not install random packages"
<emersion> is it really better than, say, just regular fedora?
<emersion> k3s will not run in a container, right?
<mupuf> emersion: from a security PoV, it means that if you get powned, you can reboot and go back to the state you had
<bentiss> emersion: yes: the set of deployed packages are tested beforehand, you can rollback to the previous installation in case enything goes wrong, and jumping between major versions is a regular update (that can be rollbacked then)
<mupuf> of course, you'll still need to fix the entry point
<bentiss> emersion: I'm running silverblue (fedora coreos for desktop) for quite a few years now, and I wouldn't go back.
<emersion> i've heard… not good things about silverblue
<emersion> coreos is the hot new thing, fedora is boring -- i've learned that boring stuff is reliable stuff
<bentiss> simply being able to test the new major, and roll back to the old one when it breaks one component is something I couldn't live without it now
<emersion> anyways, just playing the devil's advocate here
<bentiss> emersion: coreos was there before silverblue FWIW
<bentiss> emersion: one other way to see that is how many people are actually doing stuff on the runners? I know you guys do a service restart from time to time, or a reboot, but that's mainly it no (I don't do much more TBH)
<emersion> i had to debug the runners in the past, and having them on coreos would be an additional barrier for me
<bentiss> the runners are basically fire and forget, and when they are pawned we kill them and respin, so having an automatic update in the background is a nice thing too IMO
<emersion> sorry not the runner, the bare-matal host
<emersion> metal*
<bentiss> emersion: which bare-metal are you talking about?
<bentiss> gabe?
<emersion> no
<emersion> the machines running k3s
<emersion> the quinix debian boxes
<emersion> equinix
<bentiss> damn, I can't remember why you had to do this...
<emersion> i must admit, it was a long time ago, i'm not doing much stuff now
<emersion> k3s was super borked at some point
<bentiss> well, TBBH, as much I I like Fedora, I wouldn't put a production server on it. Because things can go wrong quite easily. But Fedora CoreOS is different in that there is an actual QA happening before releasing the set of packages, meaning that the chances of a failing system is much lower
<bentiss> (and you can rollback)
<bentiss> And debian is nice, but way too old for the packages that we need: up to date podman. I think that's the bit that is bitting us on the current arm runners, and why we need to reboot them every once in a while
blatant has joined #freedesktop
<svuorela> you get a new debian next month
<mupuf> svuorela: good, but it will be out of date within 6 months ;)
<bentiss> and we also need to wait for equinix to make the images available on their bare metal hosts
<bentiss> which will take one year roughly
<mupuf> and honestly, when Debian releases a new version that has unmaintained packages, it really drives home the fact that its policy of stability basically boils down to "let's not update"
<mupuf> unmaintained packages upstream*
<emersion> if we want to stick to debian, the answer would be something like
<bentiss> emersion: last time it was not available for aarch64
<mupuf> yeah, PPAs are always the answer for debian-based systems... or enabling parts of testing/sid in stable... which noone actually really tested together
<bentiss> emersion: I stand corrected: too old version
<bentiss> emersion: that's the one I use on the aarch64 runners, and it's breaking badly every month or 2
<bentiss> and FTR, podman allows us to use harbor to reduce the registry egress costs by *a lot* (but we are still paying $1400 a month)
<mupuf> <-- how can they still ship influxdb 1.6.7 (which BTW, never existed upstream), and never bothered to update past that?
<mupuf> there's been over 100 releases since that point :D
<mupuf> IMO, if they cared about providing a stable and secure system, they would not include outdated packages in a new release. Just drop it in unstable, then testing, then stable. If noone updated the package by then, then you truly don't have the bandwidth to deal with patching the package when security issues are raised
<emersion> i'd be the first to say that i don't like debian, if it was up to me, i'd use alpine x)
<bentiss> emersion: -> not fit for kubernetes
<emersion> i use alpine in production with kubernetes
<emersion> the DNS-over-TCP issue is addressed btw
<bentiss> oh, good to know then :)
<bentiss> still, using alpine also has the problem of "let's pull random packages, and hope for the best"
<emersion> you shouldn't need to pull random packages
<bentiss> well, you can't control the versions of the packages and see for the interactions between each other
<bentiss> which is what coreos, silverblue, chromeos is doing: you get a set of packages that are tested, and you can trust the deployment
<emersion> it would be like debian, except not outdated
<bentiss> I'm not talking about grabbing random packages, but random packages version
<emersion> but i wasn't really recommending alpine here, just mentioning what i'm familiar with
<bentiss> like debian or fedora
<bentiss> emersion: and I appreciate you are giving your opinion
Leopold_ has joined #freedesktop
pkira has joined #freedesktop
<DragoonAethis> bentiss: on the other hand, I played with the idea of immutable OSes and liked it a lot
dcunit3d has joined #freedesktop
<DragoonAethis> As long as all workloads are containerized properly, everything just ends up working most of the time without much manual poking
<bentiss> DragoonAethis: yes, that's what I believe too
<DragoonAethis> Even better, it forces people not to quickly hack up something on the host because it's easier and surely will be cleaned up later (which we're struggling with in our team, a lot)
<Venemo> MrCooper, daniels I know marge is just the messenger. what I mean is, could we get it to show a differrent message when the problem was in the CI system and with the MR
<Venemo> MrCooper, daniels: I know marge is just the messenger. what I mean is, could we get it to show a differrent message when the problem was in the CI system and NOT with the MR
<eric_engestrom> fwiw on my dev machines I prefer the arch way of always using the latest version of everything and updating piecemeal all the time, but for servers I think the image model (eg. coreos) where you get one big thing where everything is tied together and you swap that for the next one is really good (and rolling back is the exact same operation, jsut picking the other image, which is really valuable to be safe to try
<eric_engestrom> updating regularly and not have to wait for others to have hit issues and fixed them)
<eric_engestrom> Venemo: what do you mean? what would be the MR problems if not CI (except "can't rebase because there's a conflict")?
<Venemo> eric_engestrom: I mean for example when the CI system fails without actually running any tests
<eric_engestrom> if you mean the marge bug where she looks at the wrong pipeline, if we could detect that to print a different message we could also make it look at the other pipeline :]
<Venemo> I don't mean a marge bug
<Venemo> I mean the type of bug that was discussed above
<eric_engestrom> hold on sorry, I guess I skimmed too fast; reading back
<Venemo> sometimes a CI job just doesn't even start the CTS and simply returns a failure
<Venemo> it has happened many times by now
<Venemo> and it's annyoing because every time I have to dig into the logs and then I realize it's not my mistake
<eric_engestrom> ah, you mean "test didn't run" (eg. runner unresponsive, etc.) vs "test ran but failed"?
<eric_engestrom> gitlab reports these as failures either way, and I'm not sure it has any additional field to differentiate them that marge could read
<MrCooper> yeah, all Marge knows is "CI pipeline green/red"
<eric_engestrom> yeah is "test ran and failed" as far as gitlab knows
<eric_engestrom> there's a feature request somewhere in gitlab's issues to add support for differentiating failures (using different exit codes), but iirc gitlab hasn't started considering adding this yet
<eric_engestrom> I don't know if other CI systemc (eg. GitHub Actions) support multiple kinds of failures
<eric_engestrom> but I agree that it would be nice if it was possible to report more than "pass/fail" (or "pass/warn" when configured to ignore failures)
<eric_engestrom> a workaround that should be possible is to add a list of regex -> message pairs in marge where it reads the log of every job that failed and if a regex matches it prints the extra message
<eric_engestrom> (I don't know marge's codebase at all so I don't know how easy/hard it would be to implement)
vivia has quit [Quit: leaving]
slomo has quit [Quit: The Lounge -]
slomo has joined #freedesktop
<Venemo> MrCooper, eric_engestrom yes, that is exactly what I'm suggesting if gitlab doesn't support that, that's unfortunate
<mupuf> It wouldn't be a bad idea for Marge to detect some transient errors and auto-retry some jobs though
<mupuf> but it can be hard to distinguish a genuine infra flakyness from a regression, especially automatically
<mupuf> and infra instabilities are best handled by... fixing the issues
<mupuf> although the gitlab ones are a bit hard to address :D
<mupuf> but anyway, something along the line could be investigated
<daniels> Venemo: unless the MR is at fault because it makes jobs run for way too long …
ximion has joined #freedesktop
<daniels> bentiss: yep I’ll reply shortly, thanks :)
<bentiss> daniels: thx!
<Venemo> daniels: that would be yet another different case
<daniels> mupuf: might be worth replicating the structured data artifact stuff inside b2c; we put that there so we could do automated analysis of what happened inside the job
<daniels> including surfacing through Marge ‘this test failed’, ‘100 tests failed and I gave up at this point’, ‘the device kept hanging’, etc
<mupuf> daniels: oh, do you have a link?
dcunit3d has quit [Remote host closed the connection]
<daniels> fwiw most of Collabora is currently here this week, so it’s firefighting only until we get back
<daniels> mupuf: errr there was an MR I thought you were CCed on
<mupuf> oh... sorry about that if this was the case
<mupuf> no worries, I'll look it up!
<Venemo> daniels: nice, enjoy the beach :) may I ask where it is?
dcunit3d has joined #freedesktop
<daniels> oh you weren’t CCed, that’s probably why you didn’t comment …
<daniels> next step is getting JSON out of deqp-runner so we can include that per-run
<daniels> Venemo: Faro, southern Portugal - and we have been, thanks :)
<Venemo> cool
<daniels> slightly better than British beaches …
<Venemo> Portugese beaches are not the hottest yeah
<bentiss> daniels: nice! enjoy!
<daniels> thanks!
<mupuf> daniels: enjoy your time!
<mupuf> ok, so what you want is an execution report as a json format
<mupuf> would be good to document the format and what we want there, if we want machines to start parsing it
ximion has quit [Quit: Detached from the Matrix]
vkareh has joined #freedesktop
Leopold__ has joined #freedesktop
Leopold_ has quit [Ping timeout: 480 seconds]
AbleBacon has joined #freedesktop
agd5f_ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
Haaninjo has joined #freedesktop
blatant has quit [Ping timeout: 480 seconds]
blatant has joined #freedesktop
blatant has quit [Quit: WeeChat 3.8]
___nick___ has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
todi has joined #freedesktop
Gsac2909 has joined #freedesktop
Gsac2909 has quit [Remote host closed the connection]
Gsac has joined #freedesktop
Gsac has quit [Remote host closed the connection]
tzimmermann has quit [Quit: Leaving]
todi has quit [Remote host closed the connection]
rappet has quit [Quit: - Chat comfortably. Anywhere.]
pkira has quit [Ping timeout: 480 seconds]
rappet has joined #freedesktop
rappet has quit [Quit: - Chat comfortably. Anywhere.]
vkareh has quit [Quit: WeeChat 3.6]
rappet has joined #freedesktop
___nick___ has quit [Ping timeout: 480 seconds]
karolherbst is now known as karolherbst_
karolherbst_ is now known as karolherbst
rappet has quit [Quit: - Chat comfortably. Anywhere.]
rappet has joined #freedesktop
rappet has quit [Quit: - Chat comfortably. Anywhere.]
rappet has joined #freedesktop
ninja21859 has quit [Remote host closed the connection]
ninja21859 has joined #freedesktop
Haaninjo has quit [Quit: Ex-Chat]
Kayden has quit [Quit: -> insomnia]
sima has quit [Ping timeout: 480 seconds]
ximion has joined #freedesktop
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
Kayden has joined #freedesktop