#freedesktop on 2022-09-27 — irc logs at oftc.irclog.whitequark.org

2022-08-14 19:45 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:07 <karolherbst> something is up with runner "#2605 (Jda81xmt) fdo-equinix-m3l-9", a lot of virgl jobs time out on that

00:08 <karolherbst> daniels: is that something you can take care of? or is that somebodys else responsibility?

00:15 <karolherbst> though not all jobs seem to fail on that one..

00:22 <karolherbst> anyway.. that's blocking peoples MR so would be cool if one could deal with it :)

00:22 <karolherbst> and for anyone else seeing this: restarting the hun jobs helps, just need to be quick enough

00:52 ngcortes has quit [Remote host closed the connection]

01:08 chip_x has quit [Remote host closed the connection]

02:26 Leopold_ has quit [Remote host closed the connection]

02:46 jrayhawk has quit [Quit: Lost terminal]

02:48 jrayhawk has joined #freedesktop

03:05 ximion has quit []

03:29 dakr has quit [Ping timeout: 480 seconds]

06:10 ofourdan has joined #freedesktop

06:17 danvet has joined #freedesktop

07:05 vbenes has joined #freedesktop

07:07 <daniels> karolherbst: thanks for narrowing that down, gave it a kick now

07:10 <bentiss> daniels: Hey, not sure if you followed, but I have started the migration out from the current data center. Everything is correct right now, but I am focusing on the HDDs now, I'll take care of the SSDs once that part is over

07:10 <bentiss> (and nobody complained in the past 12 hours, so I guess it's transparent :-P )

07:10 <daniels> bentiss: I did see that, thanks so much! :)

07:11 <bentiss> daniels: I also had a look at whether we could upgrade the cluster a bit, and I went to the "let's keep things as they are" path, because I really do not want to migrate to a new cluster :/

07:12 <bentiss> but we should be able to upgrade to k3s 1.24 AFAICT, there are a few charts to change to 1.25 unfortunately

07:13 <bentiss> daniels: and going from wireguard to vxlan as the flannel backend is possible to do with live cluster, but it will introduce some downtime (because pods on vxlan nodes won't be able to contact pods on wireguard)

07:14 * daniels nods

07:16 <daniels> I did read the mails last night indeed, that was quick

07:16 <daniels> a full rebuild might be good at some point for dual-stack, but eh ...

07:17 <bentiss> yeah, but I also hope that the kubelet limitation will be lifted at some point :)

07:26 vbenes has quit []

07:26 mvlad has joined #freedesktop

07:26 vbenes has joined #freedesktop

07:45 vbenes has quit []

07:45 vbenes has joined #freedesktop

08:02 ximion has joined #freedesktop

08:15 fahien has joined #freedesktop

08:21 Leopold_ has joined #freedesktop

08:24 Leopold_ has quit [Remote host closed the connection]

08:28 Leopold_ has joined #freedesktop

08:40 MajorBiscuit has joined #freedesktop

08:51 ximion has quit []

08:52 ximion has joined #freedesktop

09:56 i-garrison has quit [Ping timeout: 480 seconds]

09:58 <bentiss> daniels: just in case, I have now upgraded to k3s 1.21.14, you might want to upgrade your local kubectl (but I'll try to upgrade up to 1.24 today)

09:58 <daniels> bentiss: ahhh thankyou, will do when I get back home - I should upgrade helm(file) as well probably? or are you still back on an older version?

09:59 chipxxx has joined #freedesktop

10:00 chipxxx has quit [Remote host closed the connection]

10:00 chipxxx has joined #freedesktop

10:00 <bentiss> daniels: haven't touched helm(file) now, I needed to upgrade k3s first so we can use the new cronjob API definitions. The beta one is scheduled to be removed in 1.25

10:01 * daniels nods

10:01 <bentiss> anyway, I'll continue working on that this afternoon, so there is no rush in upgrading on your side

10:02 <daniels> bentiss: realistically I'm not going to have time to work on infra stuff until after XDC (been buried in project stuff in between the two confs), but I'll keep an eye what's going on and I can ping you in a couple of weeks to find out what's still left to be done :)

10:02 <daniels> thankyou!

10:03 <bentiss> daniels: sounds good

10:12 AbleBacon has quit [Read error: Connection reset by peer]

10:54 Leopold_ has quit [Remote host closed the connection]

10:54 Leopold_ has joined #freedesktop

11:07 i-garrison has joined #freedesktop

11:09 fahien has quit []

11:15 chaim has joined #freedesktop

12:43 alatiera has quit [Ping timeout: 480 seconds]

14:11 <zmike> something going on with git today I guess?

14:11 <zmike> getting pubkey denied

14:12 <Prf_Jakob> Same

14:12 <bentiss> I am doing kubernetes upgrades in the background

14:13 <bentiss> zmike: would you mind retrying?

14:13 <Prf_Jakob> Or when I try to push I get "pre-receive hook declined", and when I push again I get ssh connection refused.

14:13 <Prf_Jakob> Now I'm only getting connection refused.

14:16 <bentiss> one server is stiull waiting to upgrade... seems that it introduces some connectivity issues

14:18 <bentiss> Prf_Jakob: mind testing again?

14:19 <zmike> bentiss: it's trying really hard

14:19 <Prf_Jakob> ! [remote rejected] main -> main (pre-receive hook declined)

14:19 <Prf_Jakob> error: failed to push some refs to 'gitlab.freedesktop.org:monado/utilities/metrics.git'

14:19 <zmike> same

14:19 <bentiss> monado too?

14:20 <zmike> mesa

14:21 JoshuaAshton has joined #freedesktop

14:21 <JoshuaAshton> Hey, is anyone else not able to push currently?

14:21 <JoshuaAshton> remote: GitLab: Internal API unreachable

14:21 <zmike> yep

14:21 <emersion> JoshuaAshton has the same issue with a wayland-protocols fork

14:21 <bentiss> OK, looks like if gitaly is on the new datacenter, things are not happy

14:22 Haaninjo has joined #freedesktop

14:27 <bentiss> zmike, emersion, JoshuaAshton, Prf_Jakob: when did that start? ~10 min ago or 12h ago?

14:27 <zmike> 10min

14:27 <emersion> yea

14:27 <bentiss> ok, so that might be the k8s 1.22 upgrade

14:27 <Prf_Jakob> Yeah 10mub ush

14:27 <bentiss> thanks

14:29 <JoshuaAshton> bentiss: Like 10 mins ago

14:29 <JoshuaAshton> It started before the site went down

14:29 <JoshuaAshton> then the site went down

14:29 <JoshuaAshton> then it came back

14:29 <JoshuaAshton> and it was still frogged

14:32 dakr has joined #freedesktop

14:34 <emersion> 🐸

14:38 <jekstrand> I'm getting weird errors trying to push

14:38 <emersion> yup, known issue

14:38 <karolherbst> :( just wanted to say the same

14:39 <jekstrand> Ok, cool.

14:54 <daniels> should be working again now? all the gitaly servers are showing as up

14:55 <zmike> initial testing says no

14:55 <bentiss> daniels: it's an issue with the pods not able to contact the control plane

14:55 <bentiss> so the ones that are running are OK, but it is still failing

14:55 <daniels> oh huh, ok

14:56 chipxxx has quit [Read error: Connection reset by peer]

14:56 <bentiss> yeah, the thing is we used to be able to override the Ips in the TLS certs for the control plane, and it seems 1.22 ignores that

14:56 <daniels> ouch ...

14:57 <bentiss> and you can not downgrade a HA cluster :(

15:05 <bentiss> that's weird, I can pull fine on mesa

15:06 <daniels> yeah, so the pull in that case goes fairly directly from Workhorse -> Gitaly, but the pushes are more heavily mediated by Rails

15:07 <bentiss> right, push on gitaly-0 fails

15:08 <bentiss> let me try to see if moving gitaly-0 to the old datacenter fixes that

15:10 <bentiss> daniels: so moving gitlab-shell back to the old datacenter seemed to help

15:14 alanc has quit [Remote host closed the connection]

15:14 <bentiss> emersion, jekstrand, karolherbst, zmike, JoshuaAshton: is it better now?

15:15 alanc has joined #freedesktop

15:15 <zmike> yes

15:15 <bentiss> ok, thanks!

15:15 <zmike> thanks!

15:16 <karolherbst> \o/

15:17 <emersion> daniels: i've put up a proposal at https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/459

15:23 <jekstrand> bentiss: Working great!

15:23 <Prf_Jakob> Works!

15:23 craftyguy has quit [Ping timeout: 480 seconds]

15:24 <bentiss> I am not guaranteeing I won't break it in the next hour or so, the cluster is not in a very good shape

15:29 kem has quit [Ping timeout: 480 seconds]

15:34 craftyguy has joined #freedesktop

15:37 <__tim> having problems git pushing via ssh, just seems to time out after a while (connection closed by $ip), that's because you're reconfiguring things?

15:38 kem has joined #freedesktop

15:38 <bentiss> __tim: yeah, trying to fix the issue we were having in the past hour

15:39 <__tim> ah, cool, thanks

15:41 <Venemo> is gitlab down for everyone or is it just me?

15:42 <daniels> Venemo: ^ see all previous discussion

15:42 <Venemo> ouch, sorry, I see

15:48 <bentiss> daniels: I am tempted to spin up new servers on the new data center and then migrate the workload there, and keep the old ones up only for the data

15:51 <daniels> bentiss: hmmm, doesn't that require IP migration to global? or is that already done?

15:52 <bentiss> daniels: nope, we are still in the NY, so the IP is still valid

15:52 <daniels> (assuming it's ing + webservice + sidekiq + etc in DC and only ceph/gitaly in NY)

15:52 <daniels> ohhh right, I see, the c2.mediums

15:52 <daniels> right

15:52 <daniels> (sorry, I've been on a billion hours of calls and my brain is melted)

15:52 <bentiss> the question is do I try to enable dual-stack in the snew servers

15:53 <daniels> honestly, given that you're already here and burying time (I can offer some to assist late tonight but not for a little while), I think it's probably better to have a slightly longer downtime and try it, if it saves disruption later?

15:53 <bentiss> because we can not change the node-ip of a running server, (just tried that with server-2)

15:53 <daniels> ugh

15:54 <bentiss> k, let me try to see if that works when I try to enable dual-stack on the new server

15:54 <daniels> gtg sorry - I'll be back later, will keep an eye on IRC at least

15:54 <bentiss> no worries. I'll probably have to go soon too, but now, some pods are not scheduling properly, so it should be fine, but we might simply have a meltdown during the night

16:07 MajorBiscuit has quit [Ping timeout: 480 seconds]

16:39 ngcortes has joined #freedesktop

17:03 <bentiss> daniels: giving up for today. Adding a new server almost works, but "The Equinix Metal cloud provider does not support InstancesV2" which means that it never gets the elastic IP bound

17:03 <bentiss> which is not so much of an issue right now, because everything seems to still stay up

17:04 <bentiss> and nginx on that node is then not used :)

17:04 AbleBacon has joined #freedesktop

17:16 kem has quit [Ping timeout: 480 seconds]

17:24 kem has joined #freedesktop

17:35 <karolherbst> slowly it becomes unbearable without giphy on IRC, can we move to something else? :D

17:39 <daniels> bentiss: ok, let me see if I can upgrade ccp later tonight

18:06 ybogdano has joined #freedesktop

18:23 Leopold_ has quit []

18:32 ybogdano has quit [Quit: The Lounge - https://thelounge.chat]

18:34 jarthur has quit [Quit: Textual IRC Client: www.textualapp.com]

18:40 jarthur has joined #freedesktop

19:01 dakr has quit [Read error: Connection reset by peer]

19:05 dakr has joined #freedesktop

19:08 Leopold has joined #freedesktop

19:51 alatiera has joined #freedesktop

20:05 thaller is now known as Guest1571

20:05 thaller has joined #freedesktop

20:12 Guest1571 has quit [Ping timeout: 480 seconds]

20:21 mvlad has quit [Remote host closed the connection]

20:25 ngcortes has quit [Ping timeout: 480 seconds]

20:32 Haaninjo has quit [Quit: Ex-Chat]

20:35 thaller has quit [Remote host closed the connection]

20:35 thaller has joined #freedesktop

20:38 thaller has quit [Remote host closed the connection]

20:43 alpernebbi has quit [Ping timeout: 480 seconds]

20:44 alpernebbi has joined #freedesktop

20:47 ngcortes has joined #freedesktop

21:16 ybogdano has joined #freedesktop

21:25 danvet has quit [Ping timeout: 480 seconds]

21:54 <karolherbst> daniels: .... let me guess, you are upgrading something?

21:57 <karolherbst> ahh.. seems to work now after a looong delay

21:58 <karolherbst> ehh wait.. no, it's still waiting

22:02 <jekstrand> Yeah, it's hung up for me too

22:13 <bentiss> yay, gitlab shell still migrated to the new datacenter, kicking them back home

22:14 <agd5f> same here

22:15 <bentiss> should be good now

22:15 <karolherbst> still throwing 502

22:17 <bentiss> damn, now webservice pods are not happy :(

22:33 <bentiss> the database is not accepting new connections, even for webservice, so moving it to the new datacenter too

22:37 danilo has joined #freedesktop

22:37 <zmike> oh no not again

22:39 dakr has quit [Ping timeout: 480 seconds]

22:39 danilo has quit []

22:39 dakr has joined #freedesktop

22:46 Leopold___ has joined #freedesktop

22:49 Leopold has quit [Ping timeout: 480 seconds]

22:50 <bentiss> restarting the failing server, looks like it has too many defunct tasks :(

22:58 GNUmoon has quit [Quit: Leaving]

22:59 GNUmoon has joined #freedesktop

23:02 GNUmoon has quit [Remote host closed the connection]

23:02 GNUmoon has joined #freedesktop

23:05 GNUmoon has quit [Remote host closed the connection]

23:05 GNUmoon has joined #freedesktop

23:12 <daniels> bentiss: do I need new creds for the new DC, or is it same cert?

23:12 <bentiss> daniels: hopefully the same certs, but I had to upgrade kilo

23:12 <bentiss> it works for me

23:13 <bentiss> (i.e. no changes in certs or wg config)

23:13 GNUmoon has quit [Remote host closed the connection]

23:13 GNUmoon has joined #freedesktop

23:15 GNUtoo has joined #freedesktop

23:16 <GNUtoo> Hi, maybe it is already known but I have 502/504 error when trying to access gitlab

23:17 <GNUtoo> And I've confirmed that the issue is not only on my side by trying to make a new capture through web.archive.org

23:17 <GNUtoo> Though maybe it's temporrary or something like that

23:17 <bentiss> GNUtoo: yep, known, actively trying to solve it

23:17 <GNUtoo> thanks

23:25 Trevinho has joined #freedesktop

23:27 <bentiss> daniels: I think I'll reboot all nodes one by one, there is something wrong :(

23:27 jean22 has joined #freedesktop

23:28 jean22 has joined #freedesktop

23:35 <bentiss> ok, rebooting large-12 was enough to make it rejoin the cluster

23:42 GNUmoon has quit [Remote host closed the connection]

23:43 GNUmoon has joined #freedesktop

23:47 Leopold_ has joined #freedesktop

23:48 Leopold___ has quit [Ping timeout: 480 seconds]

23:48 <daniels> bentiss: heh yep, and it definitely seems happier for me now

23:48 <daniels> thankyou!

23:48 GNUmoon has quit [Remote host closed the connection]

23:48 <bentiss> daniels: oh, it is back online :)

23:49 GNUmoon has joined #freedesktop

23:49 jean22 has quit []

23:50 GNUmoon has quit [Remote host closed the connection]

23:50 GNUmoon has joined #freedesktop

23:56 <bentiss> alright, it seems to hold, so I think I'm going to go to bed

23:57 <bentiss> daniels: summary of the evening: I manually upgraded all remaining nodes to 1.22, and then had to reboot large-12 and server-5

23:57 <bentiss> and the fact that everything was down was probably because of ceph being not happy

23:58 <daniels> and on Deb11 as well ... !

23:58 <daniels> btw how did you narrow it down to large-12/server-5? just unhealthy pods, or?

23:59 <bentiss> on the ceph monitor, the osd were either completely down for large-12, or up and down alternatively for server-5

23:59 <daniels> ahh ok, nice

23:59 <daniels> thanks

23:59 <bentiss> also, for large-12, the osd pods (rook-ceph namespace) were Pending

23:59 * daniels nods