ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
<bentiss>
daniels: ping... not sure if you had time to read my long writeup from yesterday, but still asking: anything against using Fedora CoreOS as the base distro for runners and k3s?
<emersion>
a bit worried about stability and ability for others to jump in
<emersion>
the current infra is a lot to learn, if new admins also need to learn coreos along the way…
<bentiss>
well... The more I think of it, the more I believe we should be running gitlab-runner in a container, so there is not much to learn IMO, it'll be a podman container controlled by a systemd unit, "almost" like today :)
<bentiss>
emersion: and basically coreos is "do not install random packages"
<emersion>
is it really better than, say, just regular fedora?
<emersion>
k3s will not run in a container, right?
<mupuf>
emersion: from a security PoV, it means that if you get powned, you can reboot and go back to the state you had
<bentiss>
emersion: yes: the set of deployed packages are tested beforehand, you can rollback to the previous installation in case enything goes wrong, and jumping between major versions is a regular update (that can be rollbacked then)
<mupuf>
of course, you'll still need to fix the entry point
<bentiss>
emersion: I'm running silverblue (fedora coreos for desktop) for quite a few years now, and I wouldn't go back.
<emersion>
i've heard… not good things about silverblue
<emersion>
coreos is the hot new thing, fedora is boring -- i've learned that boring stuff is reliable stuff
<bentiss>
simply being able to test the new major, and roll back to the old one when it breaks one component is something I couldn't live without it now
<emersion>
anyways, just playing the devil's advocate here
<bentiss>
emersion: coreos was there before silverblue FWIW
<bentiss>
emersion: one other way to see that is how many people are actually doing stuff on the runners? I know you guys do a service restart from time to time, or a reboot, but that's mainly it no (I don't do much more TBH)
<emersion>
i had to debug the runners in the past, and having them on coreos would be an additional barrier for me
<bentiss>
the runners are basically fire and forget, and when they are pawned we kill them and respin, so having an automatic update in the background is a nice thing too IMO
<emersion>
sorry not the runner, the bare-matal host
<emersion>
metal*
<bentiss>
emersion: which bare-metal are you talking about?
<bentiss>
gabe?
<emersion>
no
<emersion>
the machines running k3s
<emersion>
the quinix debian boxes
<emersion>
equinix
<bentiss>
damn, I can't remember why you had to do this...
<emersion>
i must admit, it was a long time ago, i'm not doing much stuff now
<emersion>
k3s was super borked at some point
<bentiss>
well, TBBH, as much I I like Fedora, I wouldn't put a production server on it. Because things can go wrong quite easily. But Fedora CoreOS is different in that there is an actual QA happening before releasing the set of packages, meaning that the chances of a failing system is much lower
<bentiss>
(and you can rollback)
<bentiss>
And debian is nice, but way too old for the packages that we need: up to date podman. I think that's the bit that is bitting us on the current arm runners, and why we need to reboot them every once in a while
blatant has joined #freedesktop
<svuorela>
you get a new debian next month
<mupuf>
svuorela: good, but it will be out of date within 6 months ;)
<bentiss>
and we also need to wait for equinix to make the images available on their bare metal hosts
<bentiss>
which will take one year roughly
<mupuf>
and honestly, when Debian releases a new version that has unmaintained packages, it really drives home the fact that its policy of stability basically boils down to "let's not update"
<bentiss>
emersion: last time it was not available for aarch64
<mupuf>
yeah, PPAs are always the answer for debian-based systems... or enabling parts of testing/sid in stable... which noone actually really tested together
<bentiss>
emersion: I stand corrected: too old version
<bentiss>
emersion: that's the one I use on the aarch64 runners, and it's breaking badly every month or 2
<mupuf>
there's been over 100 releases since that point :D
<mupuf>
IMO, if they cared about providing a stable and secure system, they would not include outdated packages in a new release. Just drop it in unstable, then testing, then stable. If noone updated the package by then, then you truly don't have the bandwidth to deal with patching the package when security issues are raised
<emersion>
i'd be the first to say that i don't like debian, if it was up to me, i'd use alpine x)
<emersion>
i use alpine in production with kubernetes
<emersion>
the DNS-over-TCP issue is addressed btw
<bentiss>
oh, good to know then :)
<bentiss>
still, using alpine also has the problem of "let's pull random packages, and hope for the best"
<emersion>
you shouldn't need to pull random packages
<bentiss>
well, you can't control the versions of the packages and see for the interactions between each other
<bentiss>
which is what coreos, silverblue, chromeos is doing: you get a set of packages that are tested, and you can trust the deployment
<emersion>
it would be like debian, except not outdated
<bentiss>
I'm not talking about grabbing random packages, but random packages version
<emersion>
but i wasn't really recommending alpine here, just mentioning what i'm familiar with
<bentiss>
like debian or fedora
<bentiss>
emersion: and I appreciate you are giving your opinion
Leopold_ has joined #freedesktop
pkira has joined #freedesktop
<DragoonAethis>
bentiss: on the other hand, I played with the idea of immutable OSes and liked it a lot
dcunit3d has joined #freedesktop
<DragoonAethis>
As long as all workloads are containerized properly, everything just ends up working most of the time without much manual poking
<bentiss>
DragoonAethis: yes, that's what I believe too
<DragoonAethis>
Even better, it forces people not to quickly hack up something on the host because it's easier and surely will be cleaned up later (which we're struggling with in our team, a lot)
<Venemo>
MrCooper, daniels I know marge is just the messenger. what I mean is, could we get it to show a differrent message when the problem was in the CI system and with the MR
<Venemo>
MrCooper, daniels: I know marge is just the messenger. what I mean is, could we get it to show a differrent message when the problem was in the CI system and NOT with the MR
<eric_engestrom>
fwiw on my dev machines I prefer the arch way of always using the latest version of everything and updating piecemeal all the time, but for servers I think the image model (eg. coreos) where you get one big thing where everything is tied together and you swap that for the next one is really good (and rolling back is the exact same operation, jsut picking the other image, which is really valuable to be safe to try
<eric_engestrom>
updating regularly and not have to wait for others to have hit issues and fixed them)
<eric_engestrom>
Venemo: what do you mean? what would be the MR problems if not CI (except "can't rebase because there's a conflict")?
<Venemo>
eric_engestrom: I mean for example when the CI system fails without actually running any tests
<eric_engestrom>
if you mean the marge bug where she looks at the wrong pipeline, if we could detect that to print a different message we could also make it look at the other pipeline :]
<Venemo>
I don't mean a marge bug
<Venemo>
I mean the type of bug that was discussed above
<eric_engestrom>
hold on sorry, I guess I skimmed too fast; reading back
<Venemo>
sometimes a CI job just doesn't even start the CTS and simply returns a failure
<Venemo>
it has happened many times by now
<Venemo>
and it's annyoing because every time I have to dig into the logs and then I realize it's not my mistake
<eric_engestrom>
ah, you mean "test didn't run" (eg. runner unresponsive, etc.) vs "test ran but failed"?
<eric_engestrom>
there's a feature request somewhere in gitlab's issues to add support for differentiating failures (using different exit codes), but iirc gitlab hasn't started considering adding this yet
<eric_engestrom>
I don't know if other CI systemc (eg. GitHub Actions) support multiple kinds of failures
<eric_engestrom>
but I agree that it would be nice if it was possible to report more than "pass/fail" (or "pass/warn" when configured to ignore failures)
<eric_engestrom>
a workaround that should be possible is to add a list of regex -> message pairs in marge where it reads the log of every job that failed and if a regex matches it prints the extra message
<eric_engestrom>
(I don't know marge's codebase at all so I don't know how easy/hard it would be to implement)
<Venemo>
daniels: that would be yet another different case
<daniels>
mupuf: might be worth replicating the structured data artifact stuff inside b2c; we put that there so we could do automated analysis of what happened inside the job
<daniels>
including surfacing through Marge ‘this test failed’, ‘100 tests failed and I gave up at this point’, ‘the device kept hanging’, etc
<mupuf>
daniels: oh, do you have a link?
dcunit3d has quit [Remote host closed the connection]