ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
Haaninjo has quit [Quit: Ex-Chat]
<DavidHeidelberg[m]> We need to s3cp gain some resilience against dns and http errors (codes like 50x) . 3 retries could be nice. (just looking at job which failed with 503 on s3cp, and it look like first try)
miracolix has joined #freedesktop
Leopold has quit [Remote host closed the connection]
Leopold has joined #freedesktop
ybogdano has quit [Ping timeout: 480 seconds]
<airlied> aarrggh a bunch of 500 killed two jobs now
ppascher has joined #freedesktop
genpaku has quit [Remote host closed the connection]
genpaku has joined #freedesktop
Leopold has quit []
Leopold_ has joined #freedesktop
agd5f_ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
jarthur has quit [Ping timeout: 480 seconds]
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
agd5f has joined #freedesktop
itoral has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
agd5f_ has joined #freedesktop
AbleBacon has quit [Read error: Connection reset by peer]
i-garrison has quit [Ping timeout: 480 seconds]
agd5f has quit [Ping timeout: 480 seconds]
ximion has quit []
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
i-garrison has joined #freedesktop
<bentiss> daniels: https://gitlab.freedesktop.org/bolt/bolt/-/jobs/34918634 -> bolt is also using docker:dind
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
miracolix has quit [Remote host closed the connection]
agd5f_ has joined #freedesktop
scrumplex_ has quit []
scrumplex has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
danvet has joined #freedesktop
<bentiss> FWIW, to solve those 500 I probably need to restart the machines
<bentiss> and we have a security update pending, so I might as well do that now
<mupuf> bentiss: yeah, sounds like a good idea
<bentiss> also doing the k3s upgrade first, it's always good to have this too
Leopold_ has quit []
Leopold_ has joined #freedesktop
<bentiss> k3s upgraded to 1.24.9, with a bunch of CVE fixes, and now starting the rollout of reboot of all machines, gitlab errors are expected
<hakzsam> CI seems to have a problem here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20729 ?
<bentiss> hakzsam: see above, rebooting all the kubernetes nodes ATM
<hakzsam> ok
<bentiss> (and I also messed up and rebooted again one node I just rebooted, so that's 2 nodes down right now)
<hakzsam> np, I will wait
<bentiss> FWIW, it takes roughly an hour, and there is also a gitlab security update pending
<Venemo> is gitlab down right now, or is my internet connection bad?
<Venemo> oh I see, it's being restarted
<bentiss> Venemo: yeah, normally it should have been better, but I mistakenly rebooted 2 control plane servers instead of 1, and that's too much for kubernetes to be happy
<Venemo> okay, no worries
<daniels> bentiss: sounds like a rough morning, sorry to hear, but thankyou for all the updates
<daniels> DavidHeidelberg[m]: yeah ci-fairy definitely needs resilience; can you please do the same to that as you did to curl?
<bentiss> daniels: not so rough, while I do the update/reboot, I tend to not have a need to use gitlab, so it doesn't change anything for me :)
<daniels> heh
<bentiss> interesting... it seems that as long as the node is cordoned, ceph does not try to recover the data that disappeared, meaning that when it comes back, we have less rebalancing to do
<bentiss> I still have 2 machines to reboot, but I'd prefer having ceph in a clean state before doing those reboot. Maybe it's time for the gitlab update then :)
mvlad has joined #freedesktop
<bentiss> that upgrade was easy. it's already done
<mupuf> wonderful, thanks a lot!
<bentiss> it still puzzles me how you can reboot/upgrade/break machines and people can still continue to do their work :)
<bentiss> (with some errors, for sure, but still)
<mupuf> The magic of load balancing, I guess
<bentiss> yeah, and kubernetes resilience
<mupuf> and I am sure the bigger orgs would be able to tell the load balancer to stop using a node first, then wait for all active connections to be over before rebooting the node and adding it back
<mupuf> to make it fully transparent
<bentiss> well, they probably also have more redundancy that we have
<mupuf> as for rebooting the load balancer... I guess they do they fuckery with having two machines with the same IP?
<bentiss> because even if ceph is nice, I couldn't configure it to have multiple pods accessing the same disk at the same time
<bentiss> the load balancer issue is solved in 2 ways: add more control plane (3 is the minimum), and BGP to assign the IPs.
<bentiss> we can not realisticly add more machines because they are provided to us for free, and I don't have enough control over BGP to quickly reassign the IP to another machine
<bentiss> and HEALTH_OK as we discuss, time for another reboot
<kusma> Marge bot says she's broken on the inside...
<bentiss> kusma: see above, kind of a bad timing
<kusma> bentiss: But she sad it 45 minutes ago and an hour ago as well...
<bentiss> kusma: yes, which is approximatively the time I started doing the reboot/upgrades
<kusma> Aha, makes sense then :)
MajorBiscuit has joined #freedesktop
<kusma> How long until things are expected to come up again? Marge seems to still be out...
<bentiss> and starting the reboot process for the last machine
<bentiss> kusma: hopefully 10-15 min
<kusma> OK, cool. Thanks :)
<bentiss> kusma: what actually happened was that as I was rebooting the nodes marge, was getting pushed to a noce that was not rebooted yet, so it kept getting kicked out
___nick___ has joined #freedesktop
<bentiss> to a node
kusma has quit [Quit: Reconnecting]
kusma has joined #freedesktop
<bentiss> alright, all machines have been rebooted, all disks are back and ceph is HEALTH_OK
kusma has quit []
kusma has joined #freedesktop
<__tim> bah, seconds before my job was about to finish :)
MajorBiscuit has quit [Ping timeout: 480 seconds]
<bentiss> upgrading harbor to 2.7.0 right now
<gkiagia> is it normal that I can't use git over ssh right now? is some service down or did DNS change or something like that?
<eric_engestrom> gkiagia: there was a reboot earlier; retry now (works for me right now)
<gkiagia> I get ssh_dispatch_run_fatal: Connection to 147.75.198.156 port 22: incorrect signature
<bentiss> gkiagia: assuming you are talking about gitlab project, see above
<bentiss> gkiagia: which project?
<gkiagia> pipewire
<gkiagia> pipewire/pipewire and pipewire/wireplumber
<bentiss> gkiagia: both work here
<gkiagia> hm, ok, I'll troubleshoot on my end
<bentiss> gkiagia: it could be related to the gitlab 15.7.5 upgrade from this morning
<bentiss> (we were in 15.7.2 previously)
<bentiss> maybe they made changes in the ssh key they accept, I haven't checked the full logs
Leopold_ has quit [Remote host closed the connection]
Leopold has joined #freedesktop
MajorBiscuit has joined #freedesktop
Leopold has quit [Remote host closed the connection]
Leopold has joined #freedesktop
ximion has joined #freedesktop
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
Leopold has quit [Remote host closed the connection]
Leopold has joined #freedesktop
itoral has quit [Remote host closed the connection]
Leopold has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
<bentiss> daniels: so I switched all the x86 runners to podman now. And while trying the arm ones... the script is not in a good shape, no?
<daniels> bentiss: it should be fine? I redid them all not too long ago
<bentiss> well, AFAICT they must have updated the image, and so at first boot it doesn't update the kernel and regenerate the initramfs and the /boot/initrd symlink is not present
<daniels> 2022-10-28T15:12:43Z
<daniels> damn ...
<daniels> but yeah, I recreated all the Arm ones <3mo ago and updated the script to fit anything that was required then
<bentiss> yeah, and thanks for that.
vsyrjala_ is now known as vsyrjala
<daniels> I can take a look at what's going wrong later this afternoon
<bentiss> daniels: I'm at the point where I can also spend some time on that too
agd5f_ has joined #freedesktop
ndufresne has quit [Excess Flood]
ndufresne has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
agd5f_ has quit [Ping timeout: 480 seconds]
agd5f has joined #freedesktop
ximion has quit []
agd5f_ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
<bentiss> daniels: alright, all done. All of the runners are now deploying fine with the scripts, and are using podman (as root). I'll need to work on docker-free-space to ensure a basic clean up and rootless container, but maybe not today
Leopold_ has quit [Ping timeout: 480 seconds]
MajorBiscuit has quit [Quit: WeeChat 3.6]
<daniels> bentiss: oh wow, nice, thankyou!
___nick___ has quit []
Leopold_ has joined #freedesktop
<bentiss> hoepfully we'll see some improvements on the registry pulls now that all of our runners are using harbor as cache
ybogdano has joined #freedesktop
<daniels> \o/
___nick___ has joined #freedesktop
<bentiss> we got 84 GB of images since yesterday, I wonder how much we will have in a week, so I put the "keep the images since the last pull" to 14 days, and we'll see how it goes after a week
<DavidHeidelberg[m]> btw. I hope you're happy guys, when I once a month cleanup mine registry images, haha.
___nick___ has joined #freedesktop
<bentiss> DavidHeidelberg[m]: as long as it pulls the images only once, it should be fine
<DavidHeidelberg[m]> bentiss: but it has to be stored somewhere anyway, right?
<bentiss> DavidHeidelberg[m]: but you could also point it at harbor.freedesktop.org/cache, this way it won't add to the GCS bill
<bentiss> DavidHeidelberg[m]: the problem we have is that the runners are pulling the images twice per job and directly pull from GCS, which tend to lead to a lot of costs
<bentiss> so if you pull it once a month, that's fine
<DavidHeidelberg[m]> s3 is not on GCS, right?
<DavidHeidelberg[m]> bentiss: do you have some image, where is what?
<bentiss> s3 is not, no
<bentiss> heh, I was seeing another pull from registry-mirror... and started to investigate just to find out that it was my local CI pulling images from there :)
<bentiss> well, vm2c
bl4ckb0ne has joined #freedesktop
<bl4ckb0ne> hi there
<bl4ckb0ne> hitting python issues on piglit CI
<bl4ckb0ne> > ERROR: Job failed: failed to pull image "python:3.10" with specified policies [always]: initializing source docker://python:3.10: reading manifest 3.10 in quay.io/libpod/python: manifest unknown: manifest unknown (manager.go:237:0s)
<bl4ckb0ne> same happens from python 3.7 to 3.10
<bentiss> bl4ckb0ne: that on is on me I guess
miracolix has joined #freedesktop
<bl4ckb0ne> thanks
<bentiss> let me disable the docker mirrors, might be just as easiest
<bentiss> bl4ckb0ne: should be good now. I have restarted the failed jobs manually
<bl4ckb0ne> thanks
<bl4ckb0ne> pipeline fixed o/
agd5f_ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
<__tim> no worries, thanks for all your work on the infrastructure :)
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
agd5f_ has joined #freedesktop
<daniels> ^^^^^
agd5f has quit [Ping timeout: 480 seconds]
<DavidHeidelberg[m]> while s3cp retry support could help, I seeing 503 failures even with 3 retries (over proxy thou): https://gitlab.freedesktop.org/mesa/mesa/-/jobs/34956436
agd5f_ has quit [Ping timeout: 480 seconds]
<daniels> DavidHeidelberg[m]: tbf they're only 7 seconds apart ...
agd5f has joined #freedesktop
<DavidHeidelberg[m]> wait_retry = 32 :/ at the end try it should wait
<DavidHeidelberg[m]> daniels: should I write MR for the wget -> curl switch?
<daniels> hopefully that's not in ms :P
<daniels> DavidHeidelberg[m]: yeah please
<DavidHeidelberg[m]> daniels: --waitretry=seconds but who knows.....
<daniels> though given that involves a full rebuild (i.e. pain), I suspect it would be better to do the ci-fairy side first, then aggregate the curl switch into one MR with a ci-templates uprev, so we only have to go through the full thing once
<DavidHeidelberg[m]> make sense
<anholt> failure I'm suspicious of since container storage has been played with: https://gitlab.freedesktop.org/anholt/mesa/-/jobs/34960919
<daniels> anholt: you're right to be suspicious, because that's the exact same thing we saw on the x86-64 machines when the podman move started ...
<daniels> bentiss: ^ is arm still weird?
<bentiss> argh... podman version -> 3.0.1...
<daniels> too old?
<bentiss> yes, that version was producing that exact errors
<bentiss> damn, I thought it was fine, but the repo with the latest podman only builds x86 dkpg
<daniels> ugh ...
<bentiss> I guess if I enable a Testing repo on stable it won't be pretty
<daniels> shouldn't the Go stuff all be statically linked?
<bentiss> we'll find out :)
<daniels> heh
<bentiss> ok, just managed to upgrade to podman 3.4
* daniels crosses fingers
<daniels> retry succeeded \o/
<bentiss> we'll have to see if it fails in a couple of hours
<bentiss> alright both have been updated
alyssa has joined #freedesktop
<alyssa> uh oh
<daniels> alyssa: yeah, that's the exact thing that just got fixed above
<daniels> retries have now succeeded
<alyssa> +1 thanks
<alyssa> will reassign
alyssa has left #freedesktop [#freedesktop]
<bentiss> so... 3.4.2 is still better than 3.0, but we still don't have the latest CVE fixes :(
<bentiss> I'll go afk a bit. Please ping me if there are more arm64 issues. Last resort we will have to recompile podman from scratch, but we'll also need to recompile golang and quite some dependencies
jarthur has joined #freedesktop
snuckls_ has joined #freedesktop
miracolix has quit [Ping timeout: 480 seconds]
kn has joined #freedesktop
AbleBacon has joined #freedesktop
Leopold_ has quit []
Leopold has joined #freedesktop
Haaninjo has joined #freedesktop
___nick___ has quit [Ping timeout: 480 seconds]
mvlad has quit [Remote host closed the connection]
pixelcluster has quit [Ping timeout: 480 seconds]
snuckls_ has quit [Remote host closed the connection]
damian has quit [Read error: Connection reset by peer]
damian has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
Harzilein has joined #freedesktop
<Harzilein> hi
pixelcluster has joined #freedesktop
<Harzilein> is there a more appropriate channel to discuss d-bus?
<Harzilein> otherwise if i don't get any objections, i'll shoot away here
<Harzilein> is there a way (short of deleting it) to make a d-bus service file inactive instead?
<Harzilein> i.e. i want clients to fail when the daemon providing the well-known name isn't already running.
Haaninjo has quit [Quit: Ex-Chat]