Dmitry2-t has joined #freedesktop
Dmitry2-t has quit [Remote host closed the connection]
jstein has quit [Ping timeout: 480 seconds]
fireduck has joined #freedesktop
fireduck has quit [Remote host closed the connection]
micurtis has joined #freedesktop
micurtis has quit [Remote host closed the connection]
ximion has quit []
crawshaw has joined #freedesktop
crawshaw has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-13 05:17:36)]
MarkOtaris has joined #freedesktop
MarkOtaris has quit [Read error: Connection reset by peer]
chomwitt has joined #freedesktop
yang22 has joined #freedesktop
yang22 has quit [Remote host closed the connection]
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<bentiss> 30 min warning before we stop gitlab
<bentiss> daniels: around?
<bentiss> daniels: I was thinking, while we take gitlab down, maybe we should also migrate lfs and uploads to packet
<bentiss> those are not so big, and would save a few $ each week
<bentiss> we could do a daily rsync to keep the 'backup' at gcs
<bentiss> taking gitlab down *NOW*
<gitlab-bot> freedesktop.org issue 363 in freedesktop "Migration of the DB to packet-HA (tentative: June 13, 2021)" [Infrastructure, Opened]
<bentiss> and temp tracker issuer (while things are down: https://gitlab.com/btissoir/fdo/-/issues/2
<gitlab-bot> Fdo issue 2 in fdo "Migration of the DB to packet-HA (tentative: June 13, 2021)" [Opened]
phuzion16 has joined #freedesktop
phuzion16 has quit [Remote host closed the connection]
jzu_ has joined #freedesktop
jzu_ has quit [Remote host closed the connection]
<bentiss> daniels: so, first few issues: velero backup fails, it gets killed...
<bentiss> and I tried soing a manual backup from packet (old cluster), and the backup repo is not set... :)
<bentiss> will retry on packet-HA
<daniels> bentiss: sorry, TZ fail …
<bentiss> daniels: no worries, haven't done much
<bentiss> daniels: I have now synced lfs from gcs to ceph
* daniels nods
<daniels> so Velero is still crashing out?
<bentiss> yep, can't make the postgres backup through velero
<bentiss> doing the redis restore now
lyf has joined #freedesktop
lyf has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-13 08:19:28)]
<bentiss> redis restored properly :)
<bentiss> and db backup properly done on task-runner on packet-HA
* bentiss <- 20 min break for a quick breakfast
<daniels> bentiss: how'd you do the backup?
<daniels> last time I just used pgdump + restore
<bentiss> daniels: backup-utility -t "${fake_timestamp}" --skip registry,uploads,artifacts,lfs,packages,external_diffs,terraform_state,repositories,pages
<bentiss> the advantage is that this already handles all the s3 upload/downloads
<daniels> yeah, that makes sense
<bentiss> I am just using https://gitlab.com/gitlab-org/charts/gitlab/raw/v4.12.3/scripts/database-upgrade though the script fails at launching commands with my kubectl
<bentiss> daniels: before doing the restere, I'll probably scale down postgres on packet
<bentiss> restore
<daniels> heh, because you have to specify context?
<bentiss> just in case :)
<bentiss> well, I edited the configmap for taskrunner
<bentiss> but, still never 100% sure it'll do the right thing
* daniels nods
<bentiss> actually, I can just delete the postgres service on old-cluster, this was I am sure task-runner on packet-HA can not access the old db
<daniels> just double-checked and it looks like nothing should be running which should be doing any access to Postgres
<bentiss> yeah, I have been on a killing spree :)
<daniels> hehe
<bentiss> still a little bit more than one hour for sync-ing uploads
<bentiss> lfs was quick
<bentiss> 17.5 GB vs 177GB, that might explain :)
<bentiss> hmm... @hashed/f7/4f/f74fee33…cb4/sample.ffvhuff.mkv: 81% /4.714G, 11.603M/s, 1m16s -> we do have big uploads...
* daniels sighs
<bentiss> daniels: FYI, got the exact same bug than https://gitlab.com/gitlab-org/gitlab/-/issues/266988 while starting the postgresdb
<gitlab-bot> GitLab.org issue 266988 in gitlab "GitLab 13.4.0: ERROR: must be owner of extensions" [Category:Omnibus Package, Bug, Devops::Enablement, Group::Distribution, Section::Enablement, Severity::3, Closed]
<bentiss> @hashed/f7/4f/f74fee33…837f58/sample.ffv1.mkv: 69% /539.750M, 9.621M/s, 17s -> same hash it looks like
<daniels> bentiss: aha, looks like they're non-fatal and known at least
<bentiss> yep
<daniels> as for the uploads, those would be GStreamer I suppose
<bentiss> yeah, they seem legit, but still 4GB for a sample...
<daniels> srsly.
<bentiss> BTW, this week I ran fstrim on all the pods, which helped a lot reducing the storage used on ceph
<bentiss> I enabled the discard option in the storageclass, so I need to re=spin the 3 gitaly pods for the otpion to be taken into account
<bentiss> @hashed/e0/f0/e0f05da9…28f/rocketleague.trace: 75% /1.205G, 10.159M/s, 30s
<bentiss> @hashed/cb/a2/cba28b89…cbd9a92/BDES.trace.zst: 62% /877.642M, 9.981M/s, 33s
<bentiss> traces are supposed to be that big?
danvet has joined #freedesktop
<daniels> heh
<daniels> they often _are_ that big, yeah
<bentiss> sigh
<daniels> the ones we use (and store in git + MinIO) are far smaller
<daniels> but cutting them down requires a bunch of work up front
<daniels> so users who submit bug reports will often just include a mega-trace along with it
<bentiss> so that means that people just attach raw traces to bugs?
<daniels> yeah, that'd be the uploads
<daniels> tbf we could just cap the max upload size at ... a lot smaller
* bentiss would think twice before attaching a 1GB file
<daniels> I would too, but we are not everyone apparently ...
<bentiss> heh
<bentiss> hmmm, do we also want to host packages on ceph?
<bentiss> I only see one user for now
<daniels> I don't see why not
<bentiss> ok, I'll sync the buckets then
* bentiss forgot to bump the postgres pvc
<bentiss> and done...
<bentiss> we are roughly at 50% of db restore (26 GB used on the postgres disk
<daniels> should we delete Redis from the old cluster?
<bentiss> I'd wait for bringing up the new cluster with the new db and then kill the gitlab ns on the old cluster
<bentiss> but if you want to kill it, go for it
<bentiss> daniels: I am wondering if we should not disable the loadbalancer IP while we spin things up, and use a temp one for the first tests
<daniels> oh yeah, that's a very good idea actually
<bentiss> that's the benefit of having a reserved IP :)
<bentiss> while prep-ing all the changes, I am also enabling consolidated object storage, now that we host everything but the registry
<daniels> hmm, and we need a global IP for EWR1?
<bentiss> nope, we can let k8s request one temporary
<daniels> oh right, non-elastic, gotcha
<daniels> great plan
flexoboto has joined #freedesktop
flexoboto has quit [Remote host closed the connection]
<bentiss> daniels: temp IP for the tests only: 147.75.79.194
<daniels> just checked the old-cluster NS on HA and there's no redis there, so yeah, we can just kill the namespace all in one I think
<daniels> was just wondering if there was a risk of accessing the wrong one, but it's all fine
<bentiss> daniels: updated https://gitlab.com/btissoir/fdo/-/issues/2 with the temp IP steps
<gitlab-bot> Fdo issue 2 in fdo "Migration of the DB to packet-HA (tentative: June 13, 2021)" [Opened]
<bentiss> daniels: how is the marge-bot transfer doing?
<daniels> bentiss: it's already done, I just left the deploy at 0 replicas
<bentiss> cool
<bentiss> did you used velero?
<daniels> I did! :)
<bentiss> it's nice isn't it?
<daniels> no PV data to transfer obviously as it's stateless, but yeah, it's really neat
<bentiss> we should do a regular backup of the whole cluster, without the PVs
<bentiss> in case I do the same mess up I did a few weeks ago
<daniels> new or old cluster?
<bentiss> new cluster I was thinking
<bentiss> I mean, we now have HA, so it's less critical
<bentiss> but it would be nice to have I think
<daniels> yeah, that sounds like a good idea & store a copy externally as well?
<bentiss> yep
<bentiss> 38GB restored over 55... I doubt we are going to hit the 11:00 UTC deadline
msavoritias-M11 has joined #freedesktop
msavoritias-M11 has quit [Remote host closed the connection]
i-garrison has joined #freedesktop
<daniels> oh well :)
cluesch2 has joined #freedesktop
<daniels> it would be nice if upstream would use pg_restore -j$LOTS to restore instead of feeding the statements in one by one ...
<bentiss> does -j ensures the ordering?
<bentiss> maybe I should try velero once again
burn has joined #freedesktop
* daniels shrugs
<daniels> tbh it's already got to 2/3rds, I don't think there's much point stopping and pivoting now
burn has quit [Remote host closed the connection]
<daniels> we can deal with another hour, it's a nice sunny Sunday
<daniels> (it's 29C in London so I'm just assuming it has to be nice everywhere else as well, given how rarely this happens)
<daniels> -j does indeed do correct ordering, but relies on the file being block-accessible
<daniels> whereas rake does gzip -d | psql
blue__penquin has joined #freedesktop
<daniels> so I think for now probably just leave it be, and for next time we need to do a pg major upgrade we can do some experiments first to find the most optimal way?
<daniels> in particular -j offloads index building and does more batching, so that can help a lot with throughput
<bentiss> yeah, I am not planning on stopping the restore just now
<bentiss> I still wonder why the backup failed with velero. on the old cluster, I just changed the sts so it has the proper size for the disk, maybe that's the reason
<bentiss> 29C here as well, I wonder how it is possible we get the same temperature...
<bentiss> same result, velero failed :(
<daniels> ooi where are you running velero from?
<bentiss> started the command from my own desktop, but the restic/velero tasks are on packet (old-cluster)
<bentiss> restic is running on the same node than postgresql
<bentiss> so agent-2 here
<daniels> ah right, how do you get to the logs on ceph?
<bentiss> for restic, you can get them through k8s directly
<bentiss> kubectl -n velero get PodVolumeBackups postgresql-2021-06-13-12-44-ntwnl -o yaml
<bentiss> for velero normal logs: (on server-2) mc cat velero/velero-backups/backups/postgresql-2021-06-13-12-44/postgresql-2021-06-13-12-44-logs.gz | gunzip
blue__penquin has quit []
<daniels> ah right, that makes sense
<bentiss> though I can't find more logs than the one last line for restic :(
<bentiss> the 'error running restic backup, stderr=: signal: killed' is apparently because the pod gets OOM killed, but the node still had free memory when I was checking while the backup was happening
<gitlab-bot> vmware-tanzu issue 2073 in velero "PartiallyFailed Backup because restic issue" [Question, Restic, Closed]
* bentiss is having lunch, bbl
kbabioch has joined #freedesktop
kbabioch has quit [Remote host closed the connection]
s-ol has joined #freedesktop
s-ol has quit [Remote host closed the connection]
<daniels> bentiss: at 44GB now so I guess we still have another hour+?
<daniels> heh, indexing the CI job table
<bentiss> seems like it, sigh
<bentiss> daniels: how did you manage to find the "indexing the CI job table"?
<daniels> bentiss: SELECT * FROM pg_stat_activity WHERE usename='gitlab';
<bentiss> heh, OK
<daniels> but it does seem like the data's all there, e.g. the notes table is fully populated
<daniels> which would be our biggest part after job logs + jobs themselves
<bentiss> ADD CONSTRAINT fk_d3130c9a7f FOREIGN KEY (commit_id) REFERENCES public.ci_pipelines(id) ON DELETE CASCADE;|client backend
<daniels> yeah, it's moved on :)
<bentiss> that's what I get now, so we must be passed the previous step
<daniels> yep!
<bentiss> \o/
<bentiss> FWIW, 1h25min left for sync-ing the uploads bucket
<daniels> and the ALTER TABLE ci_builds moved about 2min ago
<bentiss> though I interrupted it, previously to prioritise non-project/issues attached uplodas
<bentiss> so we should be fine switching over to ceph and wait for the sync to finish
<daniels> right, so it's a post-insert addition of all the public keys
<bentiss> *in the background
<daniels> I'd expect ci_build_trace_chunks + ci_build_trace_sections to take the longest; ci_builds itself took around 3min, so probably around the same time for each gives us another ... half an hour, making a wild guess?
logb0t has joined #freedesktop
logb0t has quit [Remote host closed the connection]
* bentiss hopes noone is taking bets
qyliss10 has joined #freedesktop
qyliss10 has quit [Remote host closed the connection]
<daniels> bentiss: oooh, it's done ...
<daniels> bentiss: but we need to point webservice at the new db
<bentiss> daniels: I got everything ready locally
<bentiss> first, we need to run gitlab-rake db:migrate according to the upgrade script
<daniels> yep
<daniels> I'm slightly out of date on omnibus btw; I have Grafana back at 7.4.5 instead of 8.0.0 and also I have minio-operator locally which isn't present in HA
<bentiss> k, no worries
<daniels> I can do db:migrate if you're still eating
<bentiss> daniels: my terminal after the db restore
<bentiss> the clear redis task failed but that is not a problem I think
<daniels> yeah, that's fine, we should manually clear after bringing up the new one anyway
<bentiss> gitlab-rake db:migrate done
<daniels> nice!
<daniels> I'll scale psql + redis on old cluster down to 0
<bentiss> k
<daniels> done
<bentiss> running diff on the new deployment
<daniels> btw, just checked and cache:clear is the last task invoked by the backup-utility, so I think we're not missing anything
<bentiss> \o/
<bentiss> I deleted the sts postgres (with cascade=orphan to keep the pod running), and running sync as we speak
i-garrison has quit []
<bentiss> (got to remove it or the deployment would fail given that I changed the pvc request)
* daniels knocks incessantly on the wooden table he's sitting at, wooden chair he's sitting in
* bentiss has an ikea chair not made of wood....
<bentiss> everything is up according to k3s
<bentiss> pages is not...
* bentiss restarts the 3 gitaly pods
<bentiss> pages is down because of consolidated storage
<bentiss> FWIW
<bentiss> daniels: pushed my changes to omnibus repo, so you can have a look
<bentiss> apparently, the pages not up was just because it did not synced the replica number for pages (and sidekiq FWIW)
<bentiss> daniels: are you happy if I re-enable the correct IP so we can get runners working?
<daniels> awesome, I'm in sync now
<daniels> pushed marge changes to -config and that worked fine
<daniels> yeah, I think let's switch the IP and see what happens
<bentiss> ok IP back up I think
<bentiss> some jobs passed, so that's good
<daniels> yep, I can see job logs and artifacts too
<bentiss> \o/ pages deployment is working :)
<bentiss> OK, so in 15 minutes the uploads sync will finish and we can call it a day?
<bentiss> emails are flowing too :)
* bentiss just started a new backup because I had to interrupt todays
cluesch2 has left #freedesktop [#freedesktop]
<daniels> yep, Sidekiq seems very happy, the Rake check tasks are happy too :)
<daniels> I need to go out and get some groceries etc but I'll be generally around to keep an eye on things
<daniels> have tried various bits of functionality and so far nothing fallen over
<daniels> bentiss: thanks so much for all the migration! hope you enjoy the sunshine
<bentiss> daniels: thanks, have a nice afternoon too
<daniels> hrm, one thing when deleting gitaly nodes is that you have to somehow exclude the old one from the weight settings
<daniels> atm it's complaining that no-replicas is still in there
theh has joined #freedesktop
<daniels> k, fixed
<bentiss> cool, thanks
emanuele-f has joined #freedesktop
emanuele-f has quit [Remote host closed the connection]
i-garrison has joined #freedesktop
gandhiraj[m|gr] has joined #freedesktop
gandhiraj[m|gr] has quit [Remote host closed the connection]
ximion has joined #freedesktop
avar27 has joined #freedesktop
avar27 has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-13 14:27:36)]
LOA_online has joined #freedesktop
LOA_online has quit [Remote host closed the connection]
LoKoMurdoK[m]1 has joined #freedesktop
LoKoMurdoK[m]1 has quit [Remote host closed the connection]
focus-u-f has joined #freedesktop
focus-u-f has quit [Remote host closed the connection]
pcglue has joined #freedesktop
pcglue has quit [autokilled: spambot. Dont mail support@oftc.net with questions (2021-06-13 18:16:07)]
Celphi has joined #freedesktop
Celphi has quit [Remote host closed the connection]
zooper_ has joined #freedesktop
zooper_ has quit [Remote host closed the connection]
buu____ has joined #freedesktop
buu____ has quit [Remote host closed the connection]
kmark has joined #freedesktop
kmark has quit [Remote host closed the connection]
<daniels> bentiss: weird, large-5 suddenly failing to do anything with 'container runtime is down'
<daniels> bentiss: ok, something has gone really really strange?
<daniels> bentiss: oh nm I just can't read, but did you do something with ufw at some point? k3s-large-5 can no longer connect to k3s-server-4 ...
chomwitt has quit [Ping timeout: 480 seconds]
<daniels> bentiss: I inserted some ufw rules on large-5 and server-4 to allow them to speak to each other (well, just all of 10.99.237.0/24) and all is well with the world
Ultrasauce is now known as sauce
danvet has quit [Ping timeout: 480 seconds]
i-garrison has quit []