#freedesktop on 2023-01-18 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:45 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:10 Haaninjo has quit [Quit: Ex-Chat]

00:31 <DavidHeidelberg[m]> We need to s3cp gain some resilience against dns and http errors (codes like 50x) . 3 retries could be nice. (just looking at job which failed with 503 on s3cp, and it look like first try)

01:30 miracolix has joined #freedesktop

02:43 Leopold has quit [Remote host closed the connection]

02:44 Leopold has joined #freedesktop

03:03 ybogdano has quit [Ping timeout: 480 seconds]

03:54 <airlied> aarrggh a bunch of 500 killed two jobs now

04:04 ppascher has joined #freedesktop

04:52 genpaku has quit [Remote host closed the connection]

04:56 genpaku has joined #freedesktop

05:19 Leopold has quit []

05:23 Leopold_ has joined #freedesktop

05:41 agd5f_ has joined #freedesktop

05:47 agd5f has quit [Ping timeout: 480 seconds]

06:04 jarthur has quit [Ping timeout: 480 seconds]

06:05 GNUmoon has quit [Remote host closed the connection]

06:05 GNUmoon has joined #freedesktop

06:17 agd5f has joined #freedesktop

06:18 itoral has joined #freedesktop

06:22 agd5f_ has quit [Ping timeout: 480 seconds]

06:28 agd5f_ has joined #freedesktop

06:28 AbleBacon has quit [Read error: Connection reset by peer]

06:33 i-garrison has quit [Ping timeout: 480 seconds]

06:34 agd5f has quit [Ping timeout: 480 seconds]

06:42 ximion has quit []

06:52 agd5f has joined #freedesktop

06:57 agd5f_ has quit [Ping timeout: 480 seconds]

07:01 i-garrison has joined #freedesktop

07:14 <bentiss> daniels: https://gitlab.freedesktop.org/bolt/bolt/-/jobs/34918634 -> bolt is also using docker:dind

07:31 alanc has quit [Remote host closed the connection]

07:31 alanc has joined #freedesktop

07:40 miracolix has quit [Remote host closed the connection]

07:51 agd5f_ has joined #freedesktop

07:52 scrumplex_ has quit []

07:52 scrumplex has joined #freedesktop

07:58 agd5f has quit [Ping timeout: 480 seconds]

08:05 danvet has joined #freedesktop

08:09 <bentiss> FWIW, to solve those 500 I probably need to restart the machines

08:14 <bentiss> and we have a security update pending, so I might as well do that now

08:22 <mupuf> bentiss: yeah, sounds like a good idea

08:22 <bentiss> also doing the k3s upgrade first, it's always good to have this too

08:26 Leopold_ has quit []

08:30 Leopold_ has joined #freedesktop

08:31 <bentiss> k3s upgraded to 1.24.9, with a bunch of CVE fixes, and now starting the rollout of reboot of all machines, gitlab errors are expected

08:34 <hakzsam> CI seems to have a problem here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20729 ?

08:41 <bentiss> hakzsam: see above, rebooting all the kubernetes nodes ATM

08:42 <hakzsam> ok

08:42 <bentiss> (and I also messed up and rebooted again one node I just rebooted, so that's 2 nodes down right now)

08:42 <hakzsam> np, I will wait

08:42 <bentiss> FWIW, it takes roughly an hour, and there is also a gitlab security update pending

08:44 <Venemo> is gitlab down right now, or is my internet connection bad?

08:44 <Venemo> oh I see, it's being restarted

08:47 <bentiss> Venemo: yeah, normally it should have been better, but I mistakenly rebooted 2 control plane servers instead of 1, and that's too much for kubernetes to be happy

08:47 <Venemo> okay, no worries

09:04 <daniels> bentiss: sounds like a rough morning, sorry to hear, but thankyou for all the updates

09:05 <daniels> DavidHeidelberg[m]: yeah ci-fairy definitely needs resilience; can you please do the same to that as you did to curl?

09:05 <bentiss> daniels: not so rough, while I do the update/reboot, I tend to not have a need to use gitlab, so it doesn't change anything for me :)

09:05 <daniels> heh

09:09 <bentiss> interesting... it seems that as long as the node is cordoned, ceph does not try to recover the data that disappeared, meaning that when it comes back, we have less rebalancing to do

09:22 <bentiss> I still have 2 machines to reboot, but I'd prefer having ceph in a clean state before doing those reboot. Maybe it's time for the gitlab update then :)

09:26 mvlad has joined #freedesktop

09:29 <bentiss> that upgrade was easy. it's already done

09:30 <mupuf> wonderful, thanks a lot!

09:31 <bentiss> it still puzzles me how you can reboot/upgrade/break machines and people can still continue to do their work :)

09:32 <bentiss> (with some errors, for sure, but still)

09:34 <mupuf> The magic of load balancing, I guess

09:35 <bentiss> yeah, and kubernetes resilience

09:35 <mupuf> and I am sure the bigger orgs would be able to tell the load balancer to stop using a node first, then wait for all active connections to be over before rebooting the node and adding it back

09:35 <mupuf> to make it fully transparent

09:35 <bentiss> well, they probably also have more redundancy that we have

09:36 <mupuf> as for rebooting the load balancer... I guess they do they fuckery with having two machines with the same IP?

09:36 <bentiss> because even if ceph is nice, I couldn't configure it to have multiple pods accessing the same disk at the same time

09:37 <bentiss> the load balancer issue is solved in 2 ways: add more control plane (3 is the minimum), and BGP to assign the IPs.

09:38 <bentiss> we can not realisticly add more machines because they are provided to us for free, and I don't have enough control over BGP to quickly reassign the IP to another machine

09:38 <bentiss> and HEALTH_OK as we discuss, time for another reboot

09:43 <kusma> Marge bot says she's broken on the inside...

09:46 <bentiss> kusma: see above, kind of a bad timing

09:46 <kusma> bentiss: But she sad it 45 minutes ago and an hour ago as well...

09:47 <kusma> According to https://gitlab.freedesktop.org/marge-bot

09:47 <bentiss> kusma: yes, which is approximatively the time I started doing the reboot/upgrades

09:47 <kusma> Aha, makes sense then :)

09:51 MajorBiscuit has joined #freedesktop

09:52 <kusma> How long until things are expected to come up again? Marge seems to still be out...

09:53 <bentiss> and starting the reboot process for the last machine

09:53 <bentiss> kusma: hopefully 10-15 min

09:53 <kusma> OK, cool. Thanks :)

09:55 <bentiss> kusma: what actually happened was that as I was rebooting the nodes marge, was getting pushed to a noce that was not rebooted yet, so it kept getting kicked out

09:55 ___nick___ has joined #freedesktop

09:55 <bentiss> to a node

10:05 kusma has quit [Quit: Reconnecting]

10:05 kusma has joined #freedesktop

10:10 <bentiss> alright, all machines have been rebooted, all disks are back and ceph is HEALTH_OK

10:12 kusma has quit []

10:12 kusma has joined #freedesktop

10:38 <__tim> bah, seconds before my job was about to finish :)

10:40 MajorBiscuit has quit [Ping timeout: 480 seconds]

10:43 <bentiss> upgrading harbor to 2.7.0 right now

11:10 <gkiagia> is it normal that I can't use git over ssh right now? is some service down or did DNS change or something like that?

11:12 <eric_engestrom> gkiagia: there was a reboot earlier; retry now (works for me right now)

11:12 <gkiagia> I get ssh_dispatch_run_fatal: Connection to 147.75.198.156 port 22: incorrect signature

11:12 <bentiss> gkiagia: assuming you are talking about gitlab project, see above

11:12 <bentiss> gkiagia: which project?

11:13 <gkiagia> pipewire

11:13 <gkiagia> pipewire/pipewire and pipewire/wireplumber

11:14 <bentiss> gkiagia: both work here

11:15 <gkiagia> hm, ok, I'll troubleshoot on my end

11:15 <bentiss> gkiagia: it could be related to the gitlab 15.7.5 upgrade from this morning

11:16 <bentiss> (we were in 15.7.2 previously)

11:16 <bentiss> maybe they made changes in the ssh key they accept, I haven't checked the full logs

12:25 Leopold_ has quit [Remote host closed the connection]

12:25 Leopold has joined #freedesktop

12:45 MajorBiscuit has joined #freedesktop

12:50 Leopold has quit [Remote host closed the connection]

12:50 Leopold has joined #freedesktop

12:53 ximion has joined #freedesktop

13:03 agd5f has joined #freedesktop

13:09 agd5f_ has quit [Ping timeout: 480 seconds]

13:09 Leopold has quit [Remote host closed the connection]

13:10 Leopold has joined #freedesktop

13:11 itoral has quit [Remote host closed the connection]

13:25 Leopold has quit [Remote host closed the connection]

13:27 Leopold_ has joined #freedesktop

13:38 <bentiss> daniels: so I switched all the x86 runners to podman now. And while trying the arm ones... the script is not in a good shape, no?

13:51 <daniels> bentiss: it should be fine? I redid them all not too long ago

13:52 <bentiss> well, AFAICT they must have updated the image, and so at first boot it doesn't update the kernel and regenerate the initramfs and the /boot/initrd symlink is not present

13:52 <daniels> 2022-10-28T15:12:43Z

13:52 <daniels> damn ...

13:52 <daniels> but yeah, I recreated all the Arm ones <3mo ago and updated the script to fit anything that was required then

13:54 <bentiss> yeah, and thanks for that.

13:58 vsyrjala_ is now known as vsyrjala

13:58 <daniels> I can take a look at what's going wrong later this afternoon

13:59 <bentiss> daniels: I'm at the point where I can also spend some time on that too

14:08 agd5f_ has joined #freedesktop

14:13 ndufresne has quit [Excess Flood]

14:13 ndufresne has joined #freedesktop

14:14 agd5f has quit [Ping timeout: 480 seconds]

14:18 agd5f_ has quit [Ping timeout: 480 seconds]

14:24 agd5f has joined #freedesktop

14:51 ximion has quit []

14:55 agd5f_ has joined #freedesktop

15:01 agd5f has quit [Ping timeout: 480 seconds]

15:04 agd5f has joined #freedesktop

15:04 agd5f_ has quit [Ping timeout: 480 seconds]

15:16 <bentiss> daniels: alright, all done. All of the runners are now deploying fine with the scripts, and are using podman (as root). I'll need to work on docker-free-space to ensure a basic clean up and rootless container, but maybe not today

15:50 Leopold_ has quit [Ping timeout: 480 seconds]

15:50 MajorBiscuit has quit [Quit: WeeChat 3.6]

15:54 <daniels> bentiss: oh wow, nice, thankyou!

15:57 ___nick___ has quit []

15:57 Leopold_ has joined #freedesktop

15:57 <bentiss> hoepfully we'll see some improvements on the registry pulls now that all of our runners are using harbor as cache

15:57 ybogdano has joined #freedesktop

15:57 <daniels> \o/

15:59 ___nick___ has joined #freedesktop

15:59 <bentiss> we got 84 GB of images since yesterday, I wonder how much we will have in a week, so I put the "keep the images since the last pull" to 14 days, and we'll see how it goes after a week

16:01 <DavidHeidelberg[m]> btw. I hope you're happy guys, when I once a month cleanup mine registry images, haha.

16:01 ___nick___ has joined #freedesktop

16:02 <bentiss> DavidHeidelberg[m]: as long as it pulls the images only once, it should be fine

16:02 <DavidHeidelberg[m]> bentiss: but it has to be stored somewhere anyway, right?

16:02 <bentiss> DavidHeidelberg[m]: but you could also point it at harbor.freedesktop.org/cache, this way it won't add to the GCS bill

16:03 <bentiss> DavidHeidelberg[m]: the problem we have is that the runners are pulling the images twice per job and directly pull from GCS, which tend to lead to a lot of costs

16:03 <bentiss> so if you pull it once a month, that's fine

16:04 <DavidHeidelberg[m]> s3 is not on GCS, right?

16:06 <DavidHeidelberg[m]> bentiss: do you have some image, where is what?

16:06 <bentiss> s3 is not, no

16:06 <bentiss> DavidHeidelberg[m]: https://grafana.freedesktop.org/d/XvdA5beWz/fdo-transfer-dashboard?orgId=1&from=now-15d&to=now

16:19 <bentiss> heh, I was seeing another pull from registry-mirror... and started to investigate just to find out that it was my local CI pulling images from there :)

16:20 <bentiss> well, vm2c

16:33 bl4ckb0ne has joined #freedesktop

16:34 <bl4ckb0ne> hi there

16:34 <bl4ckb0ne> hitting python issues on piglit CI

16:34 <bl4ckb0ne> > ERROR: Job failed: failed to pull image "python:3.10" with specified policies [always]: initializing source docker://python:3.10: reading manifest 3.10 in quay.io/libpod/python: manifest unknown: manifest unknown (manager.go:237:0s)

16:34 <bl4ckb0ne> same happens from python 3.7 to 3.10

16:35 <bl4ckb0ne> https://gitlab.freedesktop.org/bl4ckb0ne/piglit/-/pipelines/787019

16:35 <bentiss> bl4ckb0ne: that on is on me I guess

16:36 miracolix has joined #freedesktop

16:36 <bl4ckb0ne> thanks

16:37 <bentiss> let me disable the docker mirrors, might be just as easiest

16:41 <bentiss> bl4ckb0ne: should be good now. I have restarted the failed jobs manually

16:41 <bl4ckb0ne> thanks

16:43 <bl4ckb0ne> pipeline fixed o/

16:50 agd5f_ has joined #freedesktop

16:51 <bentiss> __tim: sorry I broke once again https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/3745

16:56 agd5f has quit [Ping timeout: 480 seconds]

16:56 <__tim> no worries, thanks for all your work on the infrastructure :)

16:57 agd5f has joined #freedesktop

17:00 agd5f_ has quit [Ping timeout: 480 seconds]

17:01 agd5f_ has joined #freedesktop

17:03 <daniels> ^^^^^

17:07 agd5f has quit [Ping timeout: 480 seconds]

17:07 <DavidHeidelberg[m]> while s3cp retry support could help, I seeing 503 failures even with 3 retries (over proxy thou): https://gitlab.freedesktop.org/mesa/mesa/-/jobs/34956436

17:10 agd5f_ has quit [Ping timeout: 480 seconds]

17:12 <daniels> DavidHeidelberg[m]: tbf they're only 7 seconds apart ...

17:12 agd5f has joined #freedesktop

17:16 <DavidHeidelberg[m]> wait_retry = 32 :/ at the end try it should wait

17:16 <DavidHeidelberg[m]> daniels: should I write MR for the wget -> curl switch?

17:16 <daniels> hopefully that's not in ms :P

17:17 <daniels> DavidHeidelberg[m]: yeah please

17:17 <DavidHeidelberg[m]> daniels: --waitretry=seconds but who knows.....

17:17 <daniels> though given that involves a full rebuild (i.e. pain), I suspect it would be better to do the ci-fairy side first, then aggregate the curl switch into one MR with a ci-templates uprev, so we only have to go through the full thing once

17:18 <DavidHeidelberg[m]> make sense

17:19 <anholt> failure I'm suspicious of since container storage has been played with: https://gitlab.freedesktop.org/anholt/mesa/-/jobs/34960919

17:22 <daniels> anholt: you're right to be suspicious, because that's the exact same thing we saw on the x86-64 machines when the podman move started ...

17:22 <daniels> bentiss: ^ is arm still weird?

17:23 <bentiss> argh... podman version -> 3.0.1...

17:24 <daniels> too old?

17:24 <bentiss> yes, that version was producing that exact errors

17:25 <bentiss> damn, I thought it was fine, but the repo with the latest podman only builds x86 dkpg

17:25 <daniels> ugh ...

17:32 <bentiss> I guess if I enable a Testing repo on stable it won't be pretty

17:33 <daniels> shouldn't the Go stuff all be statically linked?

17:34 <bentiss> we'll find out :)

17:35 <daniels> heh

17:51 <bentiss> ok, just managed to upgrade to podman 3.4

17:51 * daniels crosses fingers

17:52 <daniels> retry succeeded \o/

17:53 <bentiss> we'll have to see if it fails in a couple of hours

17:56 <bentiss> alright both have been updated

17:57 alyssa has joined #freedesktop

17:57 <alyssa> https://gitlab.freedesktop.org/mesa/mesa/-/jobs/34962826

17:57 <alyssa> uh oh

17:58 <daniels> alyssa: yeah, that's the exact thing that just got fixed above

17:58 <daniels> retries have now succeeded

17:59 <alyssa> +1 thanks

17:59 <alyssa> will reassign

18:03 alyssa has left #freedesktop [#freedesktop]

18:06 <bentiss> so... 3.4.2 is still better than 3.0, but we still don't have the latest CVE fixes :(

18:16 <bentiss> I'll go afk a bit. Please ping me if there are more arm64 issues. Last resort we will have to recompile podman from scratch, but we'll also need to recompile golang and quite some dependencies

18:22 jarthur has joined #freedesktop

18:23 snuckls_ has joined #freedesktop

18:28 miracolix has quit [Ping timeout: 480 seconds]

19:23 kn has joined #freedesktop

19:50 AbleBacon has joined #freedesktop

20:13 Leopold_ has quit []

20:14 Leopold has joined #freedesktop

21:04 Haaninjo has joined #freedesktop

21:07 ___nick___ has quit [Ping timeout: 480 seconds]

22:08 mvlad has quit [Remote host closed the connection]

22:11 pixelcluster has quit [Ping timeout: 480 seconds]

22:52 snuckls_ has quit [Remote host closed the connection]

22:52 damian has quit [Read error: Connection reset by peer]

22:57 damian has joined #freedesktop

23:24 danvet has quit [Ping timeout: 480 seconds]

23:42 Harzilein has joined #freedesktop

23:42 <Harzilein> hi

23:42 pixelcluster has joined #freedesktop

23:42 <Harzilein> is there a more appropriate channel to discuss d-bus?

23:43 <Harzilein> otherwise if i don't get any objections, i'll shoot away here

23:45 <Harzilein> is there a way (short of deleting it) to make a d-bus service file inactive instead?

23:45 <Harzilein> i.e. i want clients to fail when the daemon providing the well-known name isn't already running.

23:56 Haaninjo has quit [Quit: Ex-Chat]