<gitlab-bot>
Fdo issue 2 in fdo "Migration of the DB to packet-HA (tentative: June 13, 2021)" [Opened]
<bentiss>
daniels: how is the marge-bot transfer doing?
<daniels>
bentiss: it's already done, I just left the deploy at 0 replicas
<bentiss>
cool
<bentiss>
did you used velero?
<daniels>
I did! :)
<bentiss>
it's nice isn't it?
<daniels>
no PV data to transfer obviously as it's stateless, but yeah, it's really neat
<bentiss>
we should do a regular backup of the whole cluster, without the PVs
<bentiss>
in case I do the same mess up I did a few weeks ago
<daniels>
new or old cluster?
<bentiss>
new cluster I was thinking
<bentiss>
I mean, we now have HA, so it's less critical
<bentiss>
but it would be nice to have I think
<daniels>
yeah, that sounds like a good idea & store a copy externally as well?
<bentiss>
yep
<bentiss>
38GB restored over 55... I doubt we are going to hit the 11:00 UTC deadline
msavoritias-M11 has joined #freedesktop
msavoritias-M11 has quit [Remote host closed the connection]
i-garrison has joined #freedesktop
<daniels>
oh well :)
cluesch2 has joined #freedesktop
<daniels>
it would be nice if upstream would use pg_restore -j$LOTS to restore instead of feeding the statements in one by one ...
<bentiss>
does -j ensures the ordering?
<bentiss>
maybe I should try velero once again
burn has joined #freedesktop
* daniels
shrugs
<daniels>
tbh it's already got to 2/3rds, I don't think there's much point stopping and pivoting now
burn has quit [Remote host closed the connection]
<daniels>
we can deal with another hour, it's a nice sunny Sunday
<daniels>
(it's 29C in London so I'm just assuming it has to be nice everywhere else as well, given how rarely this happens)
<daniels>
-j does indeed do correct ordering, but relies on the file being block-accessible
<daniels>
whereas rake does gzip -d | psql
blue__penquin has joined #freedesktop
<daniels>
so I think for now probably just leave it be, and for next time we need to do a pg major upgrade we can do some experiments first to find the most optimal way?
<daniels>
in particular -j offloads index building and does more batching, so that can help a lot with throughput
<bentiss>
yeah, I am not planning on stopping the restore just now
<bentiss>
I still wonder why the backup failed with velero. on the old cluster, I just changed the sts so it has the proper size for the disk, maybe that's the reason
<bentiss>
29C here as well, I wonder how it is possible we get the same temperature...
<bentiss>
same result, velero failed :(
<daniels>
ooi where are you running velero from?
<bentiss>
started the command from my own desktop, but the restic/velero tasks are on packet (old-cluster)
<bentiss>
restic is running on the same node than postgresql
<bentiss>
so agent-2 here
<daniels>
ah right, how do you get to the logs on ceph?
<bentiss>
for restic, you can get them through k8s directly
<bentiss>
kubectl -n velero get PodVolumeBackups postgresql-2021-06-13-12-44-ntwnl -o yaml
<bentiss>
for velero normal logs: (on server-2) mc cat velero/velero-backups/backups/postgresql-2021-06-13-12-44/postgresql-2021-06-13-12-44-logs.gz | gunzip
blue__penquin has quit []
<daniels>
ah right, that makes sense
<bentiss>
though I can't find more logs than the one last line for restic :(
<bentiss>
the 'error running restic backup, stderr=: signal: killed' is apparently because the pod gets OOM killed, but the node still had free memory when I was checking while the backup was happening
<bentiss>
that's what I get now, so we must be passed the previous step
<daniels>
yep!
<bentiss>
\o/
<bentiss>
FWIW, 1h25min left for sync-ing the uploads bucket
<daniels>
and the ALTER TABLE ci_builds moved about 2min ago
<bentiss>
though I interrupted it, previously to prioritise non-project/issues attached uplodas
<bentiss>
so we should be fine switching over to ceph and wait for the sync to finish
<daniels>
right, so it's a post-insert addition of all the public keys
<bentiss>
*in the background
<daniels>
I'd expect ci_build_trace_chunks + ci_build_trace_sections to take the longest; ci_builds itself took around 3min, so probably around the same time for each gives us another ... half an hour, making a wild guess?
logb0t has joined #freedesktop
logb0t has quit [Remote host closed the connection]
* bentiss
hopes noone is taking bets
qyliss10 has joined #freedesktop
qyliss10 has quit [Remote host closed the connection]
<daniels>
bentiss: oooh, it's done ...
<daniels>
bentiss: but we need to point webservice at the new db
<bentiss>
daniels: I got everything ready locally
<bentiss>
first, we need to run gitlab-rake db:migrate according to the upgrade script
<daniels>
yep
<daniels>
I'm slightly out of date on omnibus btw; I have Grafana back at 7.4.5 instead of 8.0.0 and also I have minio-operator locally which isn't present in HA
<bentiss>
k, no worries
<daniels>
I can do db:migrate if you're still eating
<bentiss>
daniels: my terminal after the db restore
<bentiss>
the clear redis task failed but that is not a problem I think
<daniels>
yeah, that's fine, we should manually clear after bringing up the new one anyway
<bentiss>
gitlab-rake db:migrate done
<daniels>
nice!
<daniels>
I'll scale psql + redis on old cluster down to 0
<bentiss>
k
<daniels>
done
<bentiss>
running diff on the new deployment
<daniels>
btw, just checked and cache:clear is the last task invoked by the backup-utility, so I think we're not missing anything
<bentiss>
\o/
<bentiss>
I deleted the sts postgres (with cascade=orphan to keep the pod running), and running sync as we speak
i-garrison has quit []
<bentiss>
(got to remove it or the deployment would fail given that I changed the pvc request)
* daniels
knocks incessantly on the wooden table he's sitting at, wooden chair he's sitting in
* bentiss
has an ikea chair not made of wood....
<bentiss>
everything is up according to k3s
<bentiss>
pages is not...
* bentiss
restarts the 3 gitaly pods
<bentiss>
pages is down because of consolidated storage
<bentiss>
FWIW
<bentiss>
daniels: pushed my changes to omnibus repo, so you can have a look
<bentiss>
apparently, the pages not up was just because it did not synced the replica number for pages (and sidekiq FWIW)
<bentiss>
daniels: are you happy if I re-enable the correct IP so we can get runners working?
<daniels>
awesome, I'm in sync now
<daniels>
pushed marge changes to -config and that worked fine
<daniels>
yeah, I think let's switch the IP and see what happens
<bentiss>
ok IP back up I think
<bentiss>
some jobs passed, so that's good
<daniels>
yep, I can see job logs and artifacts too
<bentiss>
\o/ pages deployment is working :)
<bentiss>
OK, so in 15 minutes the uploads sync will finish and we can call it a day?
<bentiss>
emails are flowing too :)
* bentiss
just started a new backup because I had to interrupt todays
cluesch2 has left #freedesktop [#freedesktop]
<daniels>
yep, Sidekiq seems very happy, the Rake check tasks are happy too :)
<daniels>
I need to go out and get some groceries etc but I'll be generally around to keep an eye on things
<daniels>
have tried various bits of functionality and so far nothing fallen over
<daniels>
bentiss: thanks so much for all the migration! hope you enjoy the sunshine
<bentiss>
daniels: thanks, have a nice afternoon too
<daniels>
hrm, one thing when deleting gitaly nodes is that you have to somehow exclude the old one from the weight settings
<daniels>
atm it's complaining that no-replicas is still in there
theh has joined #freedesktop
<daniels>
k, fixed
<bentiss>
cool, thanks
emanuele-f has joined #freedesktop
emanuele-f has quit [Remote host closed the connection]
i-garrison has joined #freedesktop
gandhiraj[m|gr] has joined #freedesktop
gandhiraj[m|gr] has quit [Remote host closed the connection]
ximion has joined #freedesktop
avar27 has joined #freedesktop
avar27 has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-13 14:27:36)]
LOA_online has joined #freedesktop
LOA_online has quit [Remote host closed the connection]
LoKoMurdoK[m]1 has joined #freedesktop
LoKoMurdoK[m]1 has quit [Remote host closed the connection]
focus-u-f has joined #freedesktop
focus-u-f has quit [Remote host closed the connection]
pcglue has joined #freedesktop
pcglue has quit [autokilled: spambot. Dont mail support@oftc.net with questions (2021-06-13 18:16:07)]
Celphi has joined #freedesktop
Celphi has quit [Remote host closed the connection]
zooper_ has joined #freedesktop
zooper_ has quit [Remote host closed the connection]
buu____ has joined #freedesktop
buu____ has quit [Remote host closed the connection]
kmark has joined #freedesktop
kmark has quit [Remote host closed the connection]
<daniels>
bentiss: weird, large-5 suddenly failing to do anything with 'container runtime is down'
<daniels>
bentiss: ok, something has gone really really strange?
<daniels>
bentiss: oh nm I just can't read, but did you do something with ufw at some point? k3s-large-5 can no longer connect to k3s-server-4 ...
chomwitt has quit [Ping timeout: 480 seconds]
<daniels>
bentiss: I inserted some ufw rules on large-5 and server-4 to allow them to speak to each other (well, just all of 10.99.237.0/24) and all is well with the world