#freedesktop on 2023-05-10 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:45 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:48 co1umbarius has joined #freedesktop

00:49 columbarius has quit [Ping timeout: 480 seconds]

01:24 <daniels> Venemo: …

01:50 agd5f has joined #freedesktop

01:55 agd5f_ has quit [Ping timeout: 480 seconds]

02:33 jarthur has quit [Quit: Textual IRC Client: www.textualapp.com]

02:36 johnandmegh has joined #freedesktop

02:41 jarthur has joined #freedesktop

03:37 krushia has joined #freedesktop

03:40 ximion has quit [Quit: Detached from the Matrix]

04:23 craftyguy has quit [Ping timeout: 480 seconds]

04:32 damian has quit [Read error: Connection reset by peer]

04:33 damian has joined #freedesktop

04:57 johnandmegh has quit []

06:20 craftyguy has joined #freedesktop

06:59 <dj-death> anybody knows what's up with navi21 builders?

07:02 feto_bastardo has quit [Quit: Ping timeout (120 seconds)]

07:03 feto_bastardo has joined #freedesktop

07:03 <MrCooper> Venemo: Marge isn't actively involved with the CI in any way, she's just the messenger

07:08 <daniels> dj-death: that’s one for mupuf

07:08 <mupuf> dj-death: hmm, let me check

07:11 <mupuf> I see, 100% failure rate in the past night!

07:11 <mupuf> let me reboot the gateway!

07:11 tzimmermann has joined #freedesktop

07:23 AbleBacon has quit [Read error: Connection reset by peer]

07:23 <mupuf> it's back up

07:23 <mupuf> and seems to be working

07:23 <mupuf> sorry about this

07:36 slomo has quit [Quit: The Lounge - https://thelounge.chat]

07:36 slomo has joined #freedesktop

07:37 <mupuf> For the records, this failure mode was never seen before... hence why my monitoring did not alert me

07:37 <mupuf> this should be fixed now

07:40 <mupuf> \o/ * keywords-gfx10-navi21-3: EXECUTOR_SETUP_FAIL, EXECUTOR_DOWN - https://gitlab.freedesktop.org/mesa/mesa/-/jobs/41442807

07:40 <mupuf> caught in two ways now :D

07:51 <bentiss> daniels: ping... not sure if you had time to read my long writeup from yesterday, but still asking: anything against using Fedora CoreOS as the base distro for runners and k3s?

07:52 <emersion> a bit worried about stability and ability for others to jump in

07:53 <emersion> the current infra is a lot to learn, if new admins also need to learn coreos along the way…

07:54 <bentiss> well... The more I think of it, the more I believe we should be running gitlab-runner in a container, so there is not much to learn IMO, it'll be a podman container controlled by a systemd unit, "almost" like today :)

07:55 <bentiss> emersion: and basically coreos is "do not install random packages"

07:56 <emersion> is it really better than, say, just regular fedora?

07:57 <emersion> k3s will not run in a container, right?

07:57 <mupuf> emersion: from a security PoV, it means that if you get powned, you can reboot and go back to the state you had

07:57 <bentiss> emersion: yes: the set of deployed packages are tested beforehand, you can rollback to the previous installation in case enything goes wrong, and jumping between major versions is a regular update (that can be rollbacked then)

07:57 <mupuf> of course, you'll still need to fix the entry point

07:58 <bentiss> emersion: I'm running silverblue (fedora coreos for desktop) for quite a few years now, and I wouldn't go back.

07:58 <emersion> i've heard… not good things about silverblue

07:58 <emersion> coreos is the hot new thing, fedora is boring -- i've learned that boring stuff is reliable stuff

07:59 <bentiss> simply being able to test the new major, and roll back to the old one when it breaks one component is something I couldn't live without it now

07:59 <emersion> anyways, just playing the devil's advocate here

07:59 <bentiss> emersion: coreos was there before silverblue FWIW

08:00 <bentiss> emersion: one other way to see that is how many people are actually doing stuff on the runners? I know you guys do a service restart from time to time, or a reboot, but that's mainly it no (I don't do much more TBH)

08:01 <emersion> i had to debug the runners in the past, and having them on coreos would be an additional barrier for me

08:01 <bentiss> the runners are basically fire and forget, and when they are pawned we kill them and respin, so having an automatic update in the background is a nice thing too IMO

08:02 <emersion> sorry not the runner, the bare-matal host

08:02 <emersion> metal*

08:02 <bentiss> emersion: which bare-metal are you talking about?

08:02 <bentiss> gabe?

08:02 <emersion> no

08:02 <emersion> the machines running k3s

08:02 <emersion> the quinix debian boxes

08:03 <emersion> equinix

08:03 <bentiss> damn, I can't remember why you had to do this...

08:03 <emersion> i must admit, it was a long time ago, i'm not doing much stuff now

08:03 <emersion> k3s was super borked at some point

08:06 <bentiss> well, TBBH, as much I I like Fedora, I wouldn't put a production server on it. Because things can go wrong quite easily. But Fedora CoreOS is different in that there is an actual QA happening before releasing the set of packages, meaning that the chances of a failing system is much lower

08:06 <bentiss> (and you can rollback)

08:07 <bentiss> And debian is nice, but way too old for the packages that we need: up to date podman. I think that's the bit that is bitting us on the current arm runners, and why we need to reboot them every once in a while

08:09 blatant has joined #freedesktop

08:10 <svuorela> you get a new debian next month

08:10 <mupuf> svuorela: good, but it will be out of date within 6 months ;)

08:11 <bentiss> and we also need to wait for equinix to make the images available on their bare metal hosts

08:11 <bentiss> which will take one year roughly

08:12 <mupuf> and honestly, when Debian releases a new version that has unmaintained packages, it really drives home the fact that its policy of stability basically boils down to "let's not update"

08:12 <mupuf> unmaintained packages upstream*

08:13 <emersion> if we want to stick to debian, the answer would be something like https://software.opensuse.org/download/package?package=podman&project=devel%3Akubic%3Alibcontainers%3Astable

08:15 <bentiss> emersion: last time it was not available for aarch64

08:15 <mupuf> yeah, PPAs are always the answer for debian-based systems... or enabling parts of testing/sid in stable... which noone actually really tested together

08:15 <bentiss> emersion: I stand corrected: too old version

08:16 <bentiss> emersion: that's the one I use on the aarch64 runners, and it's breaking badly every month or 2

08:17 <bentiss> https://gitlab.freedesktop.org/freedesktop/helm-gitlab-infra/-/blob/main/gitlab-runner-provision/generate-cloud-init.py#L202

08:18 <bentiss> and FTR, podman allows us to use harbor to reduce the registry egress costs by *a lot* (but we are still paying $1400 a month)

08:20 <mupuf> https://packages.debian.org/unstable/influxdb <-- how can they still ship influxdb 1.6.7 (which BTW, never existed upstream), and never bothered to update past that?

08:21 <mupuf> there's been over 100 releases since that point :D

08:26 <mupuf> IMO, if they cared about providing a stable and secure system, they would not include outdated packages in a new release. Just drop it in unstable, then testing, then stable. If noone updated the package by then, then you truly don't have the bandwidth to deal with patching the package when security issues are raised

08:27 <emersion> i'd be the first to say that i don't like debian, if it was up to me, i'd use alpine x)

08:29 <bentiss> emersion: https://martinheinz.dev/blog/92 -> not fit for kubernetes

08:29 <emersion> i use alpine in production with kubernetes

08:30 <emersion> the DNS-over-TCP issue is addressed btw

08:30 <bentiss> oh, good to know then :)

08:31 <bentiss> still, using alpine also has the problem of "let's pull random packages, and hope for the best"

08:31 <emersion> you shouldn't need to pull random packages

08:32 <bentiss> well, you can't control the versions of the packages and see for the interactions between each other

08:32 <bentiss> which is what coreos, silverblue, chromeos is doing: you get a set of packages that are tested, and you can trust the deployment

08:32 <emersion> it would be like debian, except not outdated

08:33 <bentiss> I'm not talking about grabbing random packages, but random packages version

08:33 <emersion> but i wasn't really recommending alpine here, just mentioning what i'm familiar with

08:33 <bentiss> like debian or fedora

08:33 <bentiss> emersion: and I appreciate you are giving your opinion

08:41 Leopold_ has joined #freedesktop

09:09 pkira has joined #freedesktop

09:46 <DragoonAethis> bentiss: on the other hand, I played with the idea of immutable OSes and liked it a lot

09:47 dcunit3d has joined #freedesktop

09:47 <DragoonAethis> As long as all workloads are containerized properly, everything just ends up working most of the time without much manual poking

09:48 <bentiss> DragoonAethis: yes, that's what I believe too

09:48 <DragoonAethis> Even better, it forces people not to quickly hack up something on the host because it's easier and surely will be cleaned up later (which we're struggling with in our team, a lot)

09:49 <Venemo> MrCooper, daniels I know marge is just the messenger. what I mean is, could we get it to show a differrent message when the problem was in the CI system and with the MR

09:50 <Venemo> MrCooper, daniels: I know marge is just the messenger. what I mean is, could we get it to show a differrent message when the problem was in the CI system and NOT with the MR

09:50 <eric_engestrom> fwiw on my dev machines I prefer the arch way of always using the latest version of everything and updating piecemeal all the time, but for servers I think the image model (eg. coreos) where you get one big thing where everything is tied together and you swap that for the next one is really good (and rolling back is the exact same operation, jsut picking the other image, which is really valuable to be safe to try

09:50 <eric_engestrom> updating regularly and not have to wait for others to have hit issues and fixed them)

09:51 <eric_engestrom> Venemo: what do you mean? what would be the MR problems if not CI (except "can't rebase because there's a conflict")?

09:53 <Venemo> eric_engestrom: I mean for example when the CI system fails without actually running any tests

09:53 <eric_engestrom> if you mean the marge bug where she looks at the wrong pipeline, if we could detect that to print a different message we could also make it look at the other pipeline :]

09:54 <Venemo> I don't mean a marge bug

09:54 <Venemo> I mean the type of bug that was discussed above

09:54 <eric_engestrom> hold on sorry, I guess I skimmed too fast; reading back

09:54 <Venemo> sometimes a CI job just doesn't even start the CTS and simply returns a failure

09:55 <Venemo> it has happened many times by now

09:55 <Venemo> and it's annyoing because every time I have to dig into the logs and then I realize it's not my mistake

09:56 <eric_engestrom> ah, you mean "test didn't run" (eg. runner unresponsive, etc.) vs "test ran but failed"?

09:57 <Venemo> I mean for example these: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22885

09:57 <eric_engestrom> gitlab reports these as failures either way, and I'm not sure it has any additional field to differentiate them that marge could read

09:58 <MrCooper> yeah, all Marge knows is "CI pipeline green/red"

09:58 <eric_engestrom> yeah https://gitlab.freedesktop.org/mesa/mesa/-/jobs/41426757 is "test ran and failed" as far as gitlab knows

09:59 <eric_engestrom> there's a feature request somewhere in gitlab's issues to add support for differentiating failures (using different exit codes), but iirc gitlab hasn't started considering adding this yet

10:00 <eric_engestrom> I don't know if other CI systemc (eg. GitHub Actions) support multiple kinds of failures

10:01 <eric_engestrom> but I agree that it would be nice if it was possible to report more than "pass/fail" (or "pass/warn" when configured to ignore failures)

10:07 <eric_engestrom> a workaround that should be possible is to add a list of regex -> message pairs in marge where it reads the log of every job that failed and if a regex matches it prints the extra message

10:07 <eric_engestrom> (I don't know marge's codebase at all so I don't know how easy/hard it would be to implement)

10:16 vivia has quit [Quit: leaving]

10:16 slomo has quit [Quit: The Lounge - https://thelounge.chat]

10:18 slomo has joined #freedesktop

10:22 <Venemo> MrCooper, eric_engestrom yes, that is exactly what I'm suggesting if gitlab doesn't support that, that's unfortunate

11:05 <mupuf> It wouldn't be a bad idea for Marge to detect some transient errors and auto-retry some jobs though

11:07 <mupuf> but it can be hard to distinguish a genuine infra flakyness from a regression, especially automatically

11:08 <mupuf> and infra instabilities are best handled by... fixing the issues

11:08 <mupuf> although the gitlab ones are a bit hard to address :D

11:08 <mupuf> but anyway, something along the line could be investigated

11:14 <daniels> Venemo: unless the MR is at fault because it makes jobs run for way too long …

11:14 ximion has joined #freedesktop

11:14 <daniels> bentiss: yep I’ll reply shortly, thanks :)

11:15 <bentiss> daniels: thx!

11:16 <Venemo> daniels: that would be yet another different case

11:17 <daniels> mupuf: might be worth replicating the structured data artifact stuff inside b2c; we put that there so we could do automated analysis of what happened inside the job

11:18 <daniels> including surfacing through Marge ‘this test failed’, ‘100 tests failed and I gave up at this point’, ‘the device kept hanging’, etc

11:18 <mupuf> daniels: oh, do you have a link?

11:19 dcunit3d has quit [Remote host closed the connection]

11:19 <daniels> fwiw most of Collabora is currently here this week, so it’s firefighting only until we get back https://usercontent.irccloud-cdn.com/file/VgukGTE3/IMG_1850.JPG

11:19 <daniels> mupuf: errr there was an MR I thought you were CCed on

11:19 <mupuf> oh... sorry about that if this was the case

11:20 <mupuf> no worries, I'll look it up!

11:20 <Venemo> daniels: nice, enjoy the beach :) may I ask where it is?

11:20 dcunit3d has joined #freedesktop

11:20 <daniels> oh you weren’t CCed, that’s probably why you didn’t comment … https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22500

11:20 <daniels> next step is getting JSON out of deqp-runner so we can include that per-run

11:21 <daniels> Venemo: Faro, southern Portugal - and we have been, thanks :)

11:21 <Venemo> cool

11:21 <daniels> slightly better than British beaches …

11:22 <Venemo> Portugese beaches are not the hottest yeah

11:23 <bentiss> daniels: nice! enjoy!

11:24 <daniels> thanks!

11:27 <mupuf> daniels: enjoy your time!

11:29 <mupuf> ok, so what you want is an execution report as a json format

11:31 <mupuf> would be good to document the format and what we want there, if we want machines to start parsing it

11:36 ximion has quit [Quit: Detached from the Matrix]

11:47 vkareh has joined #freedesktop

11:58 Leopold__ has joined #freedesktop

12:04 Leopold_ has quit [Ping timeout: 480 seconds]

13:12 AbleBacon has joined #freedesktop

13:40 agd5f_ has joined #freedesktop

13:46 agd5f has quit [Ping timeout: 480 seconds]

14:20 Haaninjo has joined #freedesktop

14:33 blatant has quit [Ping timeout: 480 seconds]

14:40 blatant has joined #freedesktop

14:48 blatant has quit [Quit: WeeChat 3.8]

15:47 ___nick___ has joined #freedesktop

15:50 ___nick___ has quit []

15:52 ___nick___ has joined #freedesktop

16:04 todi has joined #freedesktop

16:12 Gsac2909 has joined #freedesktop

16:12 Gsac2909 has quit [Remote host closed the connection]

16:12 Gsac has joined #freedesktop

16:12 Gsac has quit [Remote host closed the connection]

16:25 tzimmermann has quit [Quit: Leaving]

16:49 todi has quit [Remote host closed the connection]

18:40 rappet has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

18:47 pkira has quit [Ping timeout: 480 seconds]

19:10 rappet has joined #freedesktop

19:16 rappet has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

19:32 vkareh has quit [Quit: WeeChat 3.6]

19:52 rappet has joined #freedesktop

20:07 ___nick___ has quit [Ping timeout: 480 seconds]

20:27 karolherbst is now known as karolherbst_

20:27 karolherbst_ is now known as karolherbst

20:27 rappet has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

20:29 rappet has joined #freedesktop

21:06 rappet has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

21:08 rappet has joined #freedesktop

21:15 ninja21859 has quit [Remote host closed the connection]

21:16 ninja21859 has joined #freedesktop

21:40 Haaninjo has quit [Quit: Ex-Chat]

21:44 Kayden has quit [Quit: -> insomnia]

22:28 sima has quit [Ping timeout: 480 seconds]

22:34 ximion has joined #freedesktop

23:38 alanc has quit [Remote host closed the connection]

23:39 alanc has joined #freedesktop

23:47 Kayden has joined #freedesktop