ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
<karolherbst>
something is up with runner "#2605 (Jda81xmt) fdo-equinix-m3l-9", a lot of virgl jobs time out on that
<karolherbst>
daniels: is that something you can take care of? or is that somebodys else responsibility?
<karolherbst>
though not all jobs seem to fail on that one..
<karolherbst>
anyway.. that's blocking peoples MR so would be cool if one could deal with it :)
<karolherbst>
and for anyone else seeing this: restarting the hun jobs helps, just need to be quick enough
ngcortes has quit [Remote host closed the connection]
chip_x has quit [Remote host closed the connection]
Leopold_ has quit [Remote host closed the connection]
jrayhawk has quit [Quit: Lost terminal]
jrayhawk has joined #freedesktop
ximion has quit []
dakr has quit [Ping timeout: 480 seconds]
ofourdan has joined #freedesktop
danvet has joined #freedesktop
vbenes has joined #freedesktop
<daniels>
karolherbst: thanks for narrowing that down, gave it a kick now
<bentiss>
daniels: Hey, not sure if you followed, but I have started the migration out from the current data center. Everything is correct right now, but I am focusing on the HDDs now, I'll take care of the SSDs once that part is over
<bentiss>
(and nobody complained in the past 12 hours, so I guess it's transparent :-P )
<daniels>
bentiss: I did see that, thanks so much! :)
<bentiss>
daniels: I also had a look at whether we could upgrade the cluster a bit, and I went to the "let's keep things as they are" path, because I really do not want to migrate to a new cluster :/
<bentiss>
but we should be able to upgrade to k3s 1.24 AFAICT, there are a few charts to change to 1.25 unfortunately
<bentiss>
daniels: and going from wireguard to vxlan as the flannel backend is possible to do with live cluster, but it will introduce some downtime (because pods on vxlan nodes won't be able to contact pods on wireguard)
* daniels
nods
<daniels>
I did read the mails last night indeed, that was quick
<daniels>
a full rebuild might be good at some point for dual-stack, but eh ...
<bentiss>
yeah, but I also hope that the kubelet limitation will be lifted at some point :)
vbenes has quit []
mvlad has joined #freedesktop
vbenes has joined #freedesktop
vbenes has quit []
vbenes has joined #freedesktop
ximion has joined #freedesktop
fahien has joined #freedesktop
Leopold_ has joined #freedesktop
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
MajorBiscuit has joined #freedesktop
ximion has quit []
ximion has joined #freedesktop
i-garrison has quit [Ping timeout: 480 seconds]
<bentiss>
daniels: just in case, I have now upgraded to k3s 1.21.14, you might want to upgrade your local kubectl (but I'll try to upgrade up to 1.24 today)
<daniels>
bentiss: ahhh thankyou, will do when I get back home - I should upgrade helm(file) as well probably? or are you still back on an older version?
chipxxx has joined #freedesktop
chipxxx has quit [Remote host closed the connection]
chipxxx has joined #freedesktop
<bentiss>
daniels: haven't touched helm(file) now, I needed to upgrade k3s first so we can use the new cronjob API definitions. The beta one is scheduled to be removed in 1.25
* daniels
nods
<bentiss>
anyway, I'll continue working on that this afternoon, so there is no rush in upgrading on your side
<daniels>
bentiss: realistically I'm not going to have time to work on infra stuff until after XDC (been buried in project stuff in between the two confs), but I'll keep an eye what's going on and I can ping you in a couple of weeks to find out what's still left to be done :)
<daniels>
thankyou!
<bentiss>
daniels: sounds good
AbleBacon has quit [Read error: Connection reset by peer]
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
i-garrison has joined #freedesktop
fahien has quit []
chaim has joined #freedesktop
alatiera has quit [Ping timeout: 480 seconds]
<zmike>
something going on with git today I guess?
<zmike>
getting pubkey denied
<Prf_Jakob>
Same
<bentiss>
I am doing kubernetes upgrades in the background
<bentiss>
zmike: would you mind retrying?
<Prf_Jakob>
Or when I try to push I get "pre-receive hook declined", and when I push again I get ssh connection refused.
<Prf_Jakob>
Now I'm only getting connection refused.
<bentiss>
one server is stiull waiting to upgrade... seems that it introduces some connectivity issues
<bentiss>
Prf_Jakob: mind testing again?
<zmike>
bentiss: it's trying really hard
<Prf_Jakob>
! [remote rejected] main -> main (pre-receive hook declined)
<Prf_Jakob>
error: failed to push some refs to 'gitlab.freedesktop.org:monado/utilities/metrics.git'
<zmike>
same
<bentiss>
monado too?
<zmike>
mesa
JoshuaAshton has joined #freedesktop
<JoshuaAshton>
Hey, is anyone else not able to push currently?
<JoshuaAshton>
remote: GitLab: Internal API unreachable
<zmike>
yep
<emersion>
JoshuaAshton has the same issue with a wayland-protocols fork
<bentiss>
OK, looks like if gitaly is on the new datacenter, things are not happy
Haaninjo has joined #freedesktop
<bentiss>
zmike, emersion, JoshuaAshton, Prf_Jakob: when did that start? ~10 min ago or 12h ago?
<zmike>
10min
<emersion>
yea
<bentiss>
ok, so that might be the k8s 1.22 upgrade
<Prf_Jakob>
Yeah 10mub ush
<bentiss>
thanks
<JoshuaAshton>
bentiss: Like 10 mins ago
<JoshuaAshton>
It started before the site went down
<JoshuaAshton>
then the site went down
<JoshuaAshton>
then it came back
<JoshuaAshton>
and it was still frogged
dakr has joined #freedesktop
<emersion>
🐸
<jekstrand>
I'm getting weird errors trying to push
<emersion>
yup, known issue
<karolherbst>
:( just wanted to say the same
<jekstrand>
Ok, cool.
<daniels>
should be working again now? all the gitaly servers are showing as up
<zmike>
initial testing says no
<bentiss>
daniels: it's an issue with the pods not able to contact the control plane
<bentiss>
so the ones that are running are OK, but it is still failing
<daniels>
oh huh, ok
chipxxx has quit [Read error: Connection reset by peer]
<bentiss>
yeah, the thing is we used to be able to override the Ips in the TLS certs for the control plane, and it seems 1.22 ignores that
<daniels>
ouch ...
<bentiss>
and you can not downgrade a HA cluster :(
<bentiss>
that's weird, I can pull fine on mesa
<daniels>
yeah, so the pull in that case goes fairly directly from Workhorse -> Gitaly, but the pushes are more heavily mediated by Rails
<bentiss>
right, push on gitaly-0 fails
<bentiss>
let me try to see if moving gitaly-0 to the old datacenter fixes that
<bentiss>
daniels: so moving gitlab-shell back to the old datacenter seemed to help
alanc has quit [Remote host closed the connection]
<bentiss>
emersion, jekstrand, karolherbst, zmike, JoshuaAshton: is it better now?
<bentiss>
I am not guaranteeing I won't break it in the next hour or so, the cluster is not in a very good shape
kem has quit [Ping timeout: 480 seconds]
craftyguy has joined #freedesktop
<__tim>
having problems git pushing via ssh, just seems to time out after a while (connection closed by $ip), that's because you're reconfiguring things?
kem has joined #freedesktop
<bentiss>
__tim: yeah, trying to fix the issue we were having in the past hour
<__tim>
ah, cool, thanks
<Venemo>
is gitlab down for everyone or is it just me?
<daniels>
Venemo: ^ see all previous discussion
<Venemo>
ouch, sorry, I see
<bentiss>
daniels: I am tempted to spin up new servers on the new data center and then migrate the workload there, and keep the old ones up only for the data
<daniels>
bentiss: hmmm, doesn't that require IP migration to global? or is that already done?
<bentiss>
daniels: nope, we are still in the NY, so the IP is still valid
<daniels>
(assuming it's ing + webservice + sidekiq + etc in DC and only ceph/gitaly in NY)
<daniels>
ohhh right, I see, the c2.mediums
<daniels>
right
<daniels>
(sorry, I've been on a billion hours of calls and my brain is melted)
<bentiss>
the question is do I try to enable dual-stack in the snew servers
<daniels>
honestly, given that you're already here and burying time (I can offer some to assist late tonight but not for a little while), I think it's probably better to have a slightly longer downtime and try it, if it saves disruption later?
<bentiss>
because we can not change the node-ip of a running server, (just tried that with server-2)
<daniels>
ugh
<bentiss>
k, let me try to see if that works when I try to enable dual-stack on the new server
<daniels>
gtg sorry - I'll be back later, will keep an eye on IRC at least
<bentiss>
no worries. I'll probably have to go soon too, but now, some pods are not scheduling properly, so it should be fine, but we might simply have a meltdown during the night
MajorBiscuit has quit [Ping timeout: 480 seconds]
ngcortes has joined #freedesktop
<bentiss>
daniels: giving up for today. Adding a new server almost works, but "The Equinix Metal cloud provider does not support InstancesV2" which means that it never gets the elastic IP bound
<bentiss>
which is not so much of an issue right now, because everything seems to still stay up
<bentiss>
and nginx on that node is then not used :)
AbleBacon has joined #freedesktop
kem has quit [Ping timeout: 480 seconds]
kem has joined #freedesktop
<karolherbst>
slowly it becomes unbearable without giphy on IRC, can we move to something else? :D
<daniels>
bentiss: ok, let me see if I can upgrade ccp later tonight