ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
<karolherbst> something is up with runner "#2605 (Jda81xmt) fdo-equinix-m3l-9", a lot of virgl jobs time out on that
<karolherbst> daniels: is that something you can take care of? or is that somebodys else responsibility?
<karolherbst> though not all jobs seem to fail on that one..
<karolherbst> anyway.. that's blocking peoples MR so would be cool if one could deal with it :)
<karolherbst> and for anyone else seeing this: restarting the hun jobs helps, just need to be quick enough
ngcortes has quit [Remote host closed the connection]
chip_x has quit [Remote host closed the connection]
Leopold_ has quit [Remote host closed the connection]
jrayhawk has quit [Quit: Lost terminal]
jrayhawk has joined #freedesktop
ximion has quit []
dakr has quit [Ping timeout: 480 seconds]
ofourdan has joined #freedesktop
danvet has joined #freedesktop
vbenes has joined #freedesktop
<daniels> karolherbst: thanks for narrowing that down, gave it a kick now
<bentiss> daniels: Hey, not sure if you followed, but I have started the migration out from the current data center. Everything is correct right now, but I am focusing on the HDDs now, I'll take care of the SSDs once that part is over
<bentiss> (and nobody complained in the past 12 hours, so I guess it's transparent :-P )
<daniels> bentiss: I did see that, thanks so much! :)
<bentiss> daniels: I also had a look at whether we could upgrade the cluster a bit, and I went to the "let's keep things as they are" path, because I really do not want to migrate to a new cluster :/
<bentiss> but we should be able to upgrade to k3s 1.24 AFAICT, there are a few charts to change to 1.25 unfortunately
<bentiss> daniels: and going from wireguard to vxlan as the flannel backend is possible to do with live cluster, but it will introduce some downtime (because pods on vxlan nodes won't be able to contact pods on wireguard)
* daniels nods
<daniels> I did read the mails last night indeed, that was quick
<daniels> a full rebuild might be good at some point for dual-stack, but eh ...
<bentiss> yeah, but I also hope that the kubelet limitation will be lifted at some point :)
vbenes has quit []
mvlad has joined #freedesktop
vbenes has joined #freedesktop
vbenes has quit []
vbenes has joined #freedesktop
ximion has joined #freedesktop
fahien has joined #freedesktop
Leopold_ has joined #freedesktop
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
MajorBiscuit has joined #freedesktop
ximion has quit []
ximion has joined #freedesktop
i-garrison has quit [Ping timeout: 480 seconds]
<bentiss> daniels: just in case, I have now upgraded to k3s 1.21.14, you might want to upgrade your local kubectl (but I'll try to upgrade up to 1.24 today)
<daniels> bentiss: ahhh thankyou, will do when I get back home - I should upgrade helm(file) as well probably? or are you still back on an older version?
chipxxx has joined #freedesktop
chipxxx has quit [Remote host closed the connection]
chipxxx has joined #freedesktop
<bentiss> daniels: haven't touched helm(file) now, I needed to upgrade k3s first so we can use the new cronjob API definitions. The beta one is scheduled to be removed in 1.25
* daniels nods
<bentiss> anyway, I'll continue working on that this afternoon, so there is no rush in upgrading on your side
<daniels> bentiss: realistically I'm not going to have time to work on infra stuff until after XDC (been buried in project stuff in between the two confs), but I'll keep an eye what's going on and I can ping you in a couple of weeks to find out what's still left to be done :)
<daniels> thankyou!
<bentiss> daniels: sounds good
AbleBacon has quit [Read error: Connection reset by peer]
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
i-garrison has joined #freedesktop
fahien has quit []
chaim has joined #freedesktop
alatiera has quit [Ping timeout: 480 seconds]
<zmike> something going on with git today I guess?
<zmike> getting pubkey denied
<Prf_Jakob> Same
<bentiss> I am doing kubernetes upgrades in the background
<bentiss> zmike: would you mind retrying?
<Prf_Jakob> Or when I try to push I get "pre-receive hook declined", and when I push again I get ssh connection refused.
<Prf_Jakob> Now I'm only getting connection refused.
<bentiss> one server is stiull waiting to upgrade... seems that it introduces some connectivity issues
<bentiss> Prf_Jakob: mind testing again?
<zmike> bentiss: it's trying really hard
<Prf_Jakob> ! [remote rejected] main -> main (pre-receive hook declined)
<Prf_Jakob> error: failed to push some refs to 'gitlab.freedesktop.org:monado/utilities/metrics.git'
<zmike> same
<bentiss> monado too?
<zmike> mesa
JoshuaAshton has joined #freedesktop
<JoshuaAshton> Hey, is anyone else not able to push currently?
<JoshuaAshton> remote: GitLab: Internal API unreachable
<zmike> yep
<emersion> JoshuaAshton has the same issue with a wayland-protocols fork
<bentiss> OK, looks like if gitaly is on the new datacenter, things are not happy
Haaninjo has joined #freedesktop
<bentiss> zmike, emersion, JoshuaAshton, Prf_Jakob: when did that start? ~10 min ago or 12h ago?
<zmike> 10min
<emersion> yea
<bentiss> ok, so that might be the k8s 1.22 upgrade
<Prf_Jakob> Yeah 10mub ush
<bentiss> thanks
<JoshuaAshton> bentiss: Like 10 mins ago
<JoshuaAshton> It started before the site went down
<JoshuaAshton> then the site went down
<JoshuaAshton> then it came back
<JoshuaAshton> and it was still frogged
dakr has joined #freedesktop
<emersion> 🐸
<jekstrand> I'm getting weird errors trying to push
<emersion> yup, known issue
<karolherbst> :( just wanted to say the same
<jekstrand> Ok, cool.
<daniels> should be working again now? all the gitaly servers are showing as up
<zmike> initial testing says no
<bentiss> daniels: it's an issue with the pods not able to contact the control plane
<bentiss> so the ones that are running are OK, but it is still failing
<daniels> oh huh, ok
chipxxx has quit [Read error: Connection reset by peer]
<bentiss> yeah, the thing is we used to be able to override the Ips in the TLS certs for the control plane, and it seems 1.22 ignores that
<daniels> ouch ...
<bentiss> and you can not downgrade a HA cluster :(
<bentiss> that's weird, I can pull fine on mesa
<daniels> yeah, so the pull in that case goes fairly directly from Workhorse -> Gitaly, but the pushes are more heavily mediated by Rails
<bentiss> right, push on gitaly-0 fails
<bentiss> let me try to see if moving gitaly-0 to the old datacenter fixes that
<bentiss> daniels: so moving gitlab-shell back to the old datacenter seemed to help
alanc has quit [Remote host closed the connection]
<bentiss> emersion, jekstrand, karolherbst, zmike, JoshuaAshton: is it better now?
alanc has joined #freedesktop
<zmike> yes
<bentiss> ok, thanks!
<zmike> thanks!
<karolherbst> \o/
<emersion> daniels: i've put up a proposal at https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/459
<jekstrand> bentiss: Working great!
<Prf_Jakob> Works!
craftyguy has quit [Ping timeout: 480 seconds]
<bentiss> I am not guaranteeing I won't break it in the next hour or so, the cluster is not in a very good shape
kem has quit [Ping timeout: 480 seconds]
craftyguy has joined #freedesktop
<__tim> having problems git pushing via ssh, just seems to time out after a while (connection closed by $ip), that's because you're reconfiguring things?
kem has joined #freedesktop
<bentiss> __tim: yeah, trying to fix the issue we were having in the past hour
<__tim> ah, cool, thanks
<Venemo> is gitlab down for everyone or is it just me?
<daniels> Venemo: ^ see all previous discussion
<Venemo> ouch, sorry, I see
<bentiss> daniels: I am tempted to spin up new servers on the new data center and then migrate the workload there, and keep the old ones up only for the data
<daniels> bentiss: hmmm, doesn't that require IP migration to global? or is that already done?
<bentiss> daniels: nope, we are still in the NY, so the IP is still valid
<daniels> (assuming it's ing + webservice + sidekiq + etc in DC and only ceph/gitaly in NY)
<daniels> ohhh right, I see, the c2.mediums
<daniels> right
<daniels> (sorry, I've been on a billion hours of calls and my brain is melted)
<bentiss> the question is do I try to enable dual-stack in the snew servers
<daniels> honestly, given that you're already here and burying time (I can offer some to assist late tonight but not for a little while), I think it's probably better to have a slightly longer downtime and try it, if it saves disruption later?
<bentiss> because we can not change the node-ip of a running server, (just tried that with server-2)
<daniels> ugh
<bentiss> k, let me try to see if that works when I try to enable dual-stack on the new server
<daniels> gtg sorry - I'll be back later, will keep an eye on IRC at least
<bentiss> no worries. I'll probably have to go soon too, but now, some pods are not scheduling properly, so it should be fine, but we might simply have a meltdown during the night
MajorBiscuit has quit [Ping timeout: 480 seconds]
ngcortes has joined #freedesktop
<bentiss> daniels: giving up for today. Adding a new server almost works, but "The Equinix Metal cloud provider does not support InstancesV2" which means that it never gets the elastic IP bound
<bentiss> which is not so much of an issue right now, because everything seems to still stay up
<bentiss> and nginx on that node is then not used :)
AbleBacon has joined #freedesktop
kem has quit [Ping timeout: 480 seconds]
kem has joined #freedesktop
<karolherbst> slowly it becomes unbearable without giphy on IRC, can we move to something else? :D
<daniels> bentiss: ok, let me see if I can upgrade ccp later tonight
ybogdano has joined #freedesktop
Leopold_ has quit []
ybogdano has quit [Quit: The Lounge - https://thelounge.chat]
jarthur has quit [Quit: Textual IRC Client: www.textualapp.com]
jarthur has joined #freedesktop
dakr has quit [Read error: Connection reset by peer]
dakr has joined #freedesktop
Leopold has joined #freedesktop
alatiera has joined #freedesktop
thaller is now known as Guest1571
thaller has joined #freedesktop
Guest1571 has quit [Ping timeout: 480 seconds]
mvlad has quit [Remote host closed the connection]
ngcortes has quit [Ping timeout: 480 seconds]
Haaninjo has quit [Quit: Ex-Chat]
thaller has quit [Remote host closed the connection]
thaller has joined #freedesktop
thaller has quit [Remote host closed the connection]
alpernebbi has quit [Ping timeout: 480 seconds]
alpernebbi has joined #freedesktop
ngcortes has joined #freedesktop
ybogdano has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
<karolherbst> daniels: .... let me guess, you are upgrading something?
<karolherbst> ahh.. seems to work now after a looong delay
<karolherbst> ehh wait.. no, it's still waiting
<jekstrand> Yeah, it's hung up for me too
<bentiss> yay, gitlab shell still migrated to the new datacenter, kicking them back home
<agd5f> same here
<bentiss> should be good now
<karolherbst> still throwing 502
<bentiss> damn, now webservice pods are not happy :(
<bentiss> the database is not accepting new connections, even for webservice, so moving it to the new datacenter too
danilo has joined #freedesktop
<zmike> oh no not again
dakr has quit [Ping timeout: 480 seconds]
danilo has quit []
dakr has joined #freedesktop
Leopold___ has joined #freedesktop
Leopold has quit [Ping timeout: 480 seconds]
<bentiss> restarting the failing server, looks like it has too many defunct tasks :(
GNUmoon has quit [Quit: Leaving]
GNUmoon has joined #freedesktop
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
<daniels> bentiss: do I need new creds for the new DC, or is it same cert?
<bentiss> daniels: hopefully the same certs, but I had to upgrade kilo
<bentiss> it works for me
<bentiss> (i.e. no changes in certs or wg config)
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
GNUtoo has joined #freedesktop
<GNUtoo> Hi, maybe it is already known but I have 502/504 error when trying to access gitlab
<GNUtoo> And I've confirmed that the issue is not only on my side by trying to make a new capture through web.archive.org
<GNUtoo> Though maybe it's temporrary or something like that
<bentiss> GNUtoo: yep, known, actively trying to solve it
<GNUtoo> thanks
Trevinho has joined #freedesktop
<bentiss> daniels: I think I'll reboot all nodes one by one, there is something wrong :(
jean22 has joined #freedesktop
jean22 has joined #freedesktop
<bentiss> ok, rebooting large-12 was enough to make it rejoin the cluster
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
Leopold_ has joined #freedesktop
Leopold___ has quit [Ping timeout: 480 seconds]
<daniels> bentiss: heh yep, and it definitely seems happier for me now
<daniels> thankyou!
GNUmoon has quit [Remote host closed the connection]
<bentiss> daniels: oh, it is back online :)
GNUmoon has joined #freedesktop
jean22 has quit []
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
<bentiss> alright, it seems to hold, so I think I'm going to go to bed
<bentiss> daniels: summary of the evening: I manually upgraded all remaining nodes to 1.22, and then had to reboot large-12 and server-5
<bentiss> and the fact that everything was down was probably because of ceph being not happy
<daniels> and on Deb11 as well ... !
<daniels> btw how did you narrow it down to large-12/server-5? just unhealthy pods, or?
<bentiss> on the ceph monitor, the osd were either completely down for large-12, or up and down alternatively for server-5
<daniels> ahh ok, nice
<daniels> thanks
<bentiss> also, for large-12, the osd pods (rook-ceph namespace) were Pending
* daniels nods