#freedesktop on 2021-06-13 — irc logs at oftc.irclog.whitequark.org

01:17 Dmitry2-t has joined #freedesktop

01:18 Dmitry2-t has quit [Remote host closed the connection]

02:12 jstein has quit [Ping timeout: 480 seconds]

02:17 fireduck has joined #freedesktop

02:17 fireduck has quit [Remote host closed the connection]

02:56 micurtis has joined #freedesktop

02:57 micurtis has quit [Remote host closed the connection]

04:26 ximion has quit []

05:17 crawshaw has joined #freedesktop

05:17 crawshaw has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-13 05:17:36)]

05:25 MarkOtaris has joined #freedesktop

05:26 MarkOtaris has quit [Read error: Connection reset by peer]

05:44 chomwitt has joined #freedesktop

05:56 yang22 has joined #freedesktop

05:57 yang22 has quit [Remote host closed the connection]

06:32 alanc has quit [Remote host closed the connection]

06:33 alanc has joined #freedesktop

06:34 <bentiss> 30 min warning before we stop gitlab

07:04 <bentiss> daniels: around?

07:06 <bentiss> daniels: I was thinking, while we take gitlab down, maybe we should also migrate lfs and uploads to packet

07:07 <bentiss> those are not so big, and would save a few $ each week

07:07 <bentiss> we could do a daily rsync to keep the 'backup' at gcs

07:09 <bentiss> taking gitlab down *NOW*

07:09 <bentiss> tracking issue: https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/363

07:09 <gitlab-bot> freedesktop.org issue 363 in freedesktop "Migration of the DB to packet-HA (tentative: June 13, 2021)" [Infrastructure, Opened]

07:09 <bentiss> and temp tracker issuer (while things are down: https://gitlab.com/btissoir/fdo/-/issues/2

07:09 <gitlab-bot> Fdo issue 2 in fdo "Migration of the DB to packet-HA (tentative: June 13, 2021)" [Opened]

07:34 phuzion16 has joined #freedesktop

07:34 phuzion16 has quit [Remote host closed the connection]

07:53 jzu_ has joined #freedesktop

07:53 jzu_ has quit [Remote host closed the connection]

08:01 <bentiss> daniels: so, first few issues: velero backup fails, it gets killed...

08:01 <bentiss> and I tried soing a manual backup from packet (old cluster), and the backup repo is not set... :)

08:02 <bentiss> will retry on packet-HA

08:04 <daniels> bentiss: sorry, TZ fail …

08:04 <bentiss> daniels: no worries, haven't done much

08:06 <bentiss> daniels: I have now synced lfs from gcs to ceph

08:14 * daniels nods

08:17 <daniels> so Velero is still crashing out?

08:18 <bentiss> yep, can't make the postgres backup through velero

08:18 <bentiss> doing the redis restore now

08:19 lyf has joined #freedesktop

08:19 lyf has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-13 08:19:28)]

08:19 <bentiss> redis restored properly :)

08:20 <bentiss> and db backup properly done on task-runner on packet-HA

08:21 * bentiss <- 20 min break for a quick breakfast

08:27 <daniels> bentiss: how'd you do the backup?

08:28 <daniels> last time I just used pgdump + restore

08:35 <bentiss> daniels: backup-utility -t "${fake_timestamp}" --skip registry,uploads,artifacts,lfs,packages,external_diffs,terraform_state,repositories,pages

08:36 <bentiss> the advantage is that this already handles all the s3 upload/downloads

08:36 <daniels> yeah, that makes sense

08:38 <bentiss> I am just using https://gitlab.com/gitlab-org/charts/gitlab/raw/v4.12.3/scripts/database-upgrade though the script fails at launching commands with my kubectl

08:40 <bentiss> daniels: before doing the restere, I'll probably scale down postgres on packet

08:40 <bentiss> restore

08:40 <daniels> heh, because you have to specify context?

08:40 <bentiss> just in case :)

08:40 <bentiss> well, I edited the configmap for taskrunner

08:40 <bentiss> but, still never 100% sure it'll do the right thing

08:41 * daniels nods

08:41 <bentiss> actually, I can just delete the postgres service on old-cluster, this was I am sure task-runner on packet-HA can not access the old db

08:42 <daniels> just double-checked and it looks like nothing should be running which should be doing any access to Postgres

08:43 <bentiss> yeah, I have been on a killing spree :)

08:43 <daniels> hehe

08:44 <bentiss> still a little bit more than one hour for sync-ing uploads

08:44 <bentiss> lfs was quick

08:44 <bentiss> 17.5 GB vs 177GB, that might explain :)

08:46 <bentiss> hmm... @hashed/f7/4f/f74fee33…cb4/sample.ffvhuff.mkv: 81% /4.714G, 11.603M/s, 1m16s -> we do have big uploads...

08:47 * daniels sighs

08:47 <bentiss> daniels: FYI, got the exact same bug than https://gitlab.com/gitlab-org/gitlab/-/issues/266988 while starting the postgresdb

08:47 <gitlab-bot> GitLab.org issue 266988 in gitlab "GitLab 13.4.0: ERROR: must be owner of extensions" [Category:Omnibus Package, Bug, Devops::Enablement, Group::Distribution, Section::Enablement, Severity::3, Closed]

08:48 <bentiss> @hashed/f7/4f/f74fee33…837f58/sample.ffv1.mkv: 69% /539.750M, 9.621M/s, 17s -> same hash it looks like

08:48 <daniels> bentiss: aha, looks like they're non-fatal and known at least

08:48 <bentiss> yep

08:48 <daniels> as for the uploads, those would be GStreamer I suppose

08:49 <bentiss> yeah, they seem legit, but still 4GB for a sample...

08:51 <daniels> srsly.

08:52 <bentiss> BTW, this week I ran fstrim on all the pods, which helped a lot reducing the storage used on ceph

08:52 <bentiss> I enabled the discard option in the storageclass, so I need to re=spin the 3 gitaly pods for the otpion to be taken into account

08:53 <bentiss> @hashed/e0/f0/e0f05da9…28f/rocketleague.trace: 75% /1.205G, 10.159M/s, 30s

08:53 <bentiss> @hashed/cb/a2/cba28b89…cbd9a92/BDES.trace.zst: 62% /877.642M, 9.981M/s, 33s

08:53 <bentiss> traces are supposed to be that big?

08:56 danvet has joined #freedesktop

09:03 <daniels> heh

09:03 <daniels> they often _are_ that big, yeah

09:03 <bentiss> sigh

09:03 <daniels> the ones we use (and store in git + MinIO) are far smaller

09:03 <daniels> but cutting them down requires a bunch of work up front

09:03 <daniels> so users who submit bug reports will often just include a mega-trace along with it

09:03 <bentiss> so that means that people just attach raw traces to bugs?

09:04 <daniels> yeah, that'd be the uploads

09:04 <daniels> tbf we could just cap the max upload size at ... a lot smaller

09:04 * bentiss would think twice before attaching a 1GB file

09:04 <daniels> I would too, but we are not everyone apparently ...

09:04 <bentiss> heh

09:07 <bentiss> hmmm, do we also want to host packages on ceph?

09:07 <bentiss> I only see one user for now

09:10 <daniels> I don't see why not

09:11 <bentiss> ok, I'll sync the buckets then

09:29 * bentiss forgot to bump the postgres pvc

09:30 <bentiss> and done...

09:31 <bentiss> we are roughly at 50% of db restore (26 GB used on the postgres disk

09:34 <daniels> should we delete Redis from the old cluster?

09:38 <bentiss> I'd wait for bringing up the new cluster with the new db and then kill the gitlab ns on the old cluster

09:38 <bentiss> but if you want to kill it, go for it

09:42 <bentiss> daniels: I am wondering if we should not disable the loadbalancer IP while we spin things up, and use a temp one for the first tests

09:43 <daniels> oh yeah, that's a very good idea actually

09:43 <bentiss> that's the benefit of having a reserved IP :)

09:44 <bentiss> while prep-ing all the changes, I am also enabling consolidated object storage, now that we host everything but the registry

09:44 <daniels> hmm, and we need a global IP for EWR1?

09:45 <bentiss> nope, we can let k8s request one temporary

09:45 <daniels> oh right, non-elastic, gotcha

09:46 <daniels> great plan

09:46 flexoboto has joined #freedesktop

09:46 flexoboto has quit [Remote host closed the connection]

09:48 <bentiss> daniels: temp IP for the tests only: 147.75.79.194

09:48 <daniels> just checked the old-cluster NS on HA and there's no redis there, so yeah, we can just kill the namespace all in one I think

09:48 <daniels> was just wondering if there was a risk of accessing the wrong one, but it's all fine

09:53 <bentiss> daniels: updated https://gitlab.com/btissoir/fdo/-/issues/2 with the temp IP steps

09:53 <gitlab-bot> Fdo issue 2 in fdo "Migration of the DB to packet-HA (tentative: June 13, 2021)" [Opened]

10:00 <bentiss> daniels: how is the marge-bot transfer doing?

10:00 <daniels> bentiss: it's already done, I just left the deploy at 0 replicas

10:00 <bentiss> cool

10:00 <bentiss> did you used velero?

10:00 <daniels> I did! :)

10:00 <bentiss> it's nice isn't it?

10:00 <daniels> no PV data to transfer obviously as it's stateless, but yeah, it's really neat

10:01 <bentiss> we should do a regular backup of the whole cluster, without the PVs

10:01 <bentiss> in case I do the same mess up I did a few weeks ago

10:10 <daniels> new or old cluster?

10:10 <bentiss> new cluster I was thinking

10:10 <bentiss> I mean, we now have HA, so it's less critical

10:10 <bentiss> but it would be nice to have I think

10:13 <daniels> yeah, that sounds like a good idea & store a copy externally as well?

10:14 <bentiss> yep

10:25 <bentiss> 38GB restored over 55... I doubt we are going to hit the 11:00 UTC deadline

10:31 msavoritias-M11 has joined #freedesktop

10:31 msavoritias-M11 has quit [Remote host closed the connection]

10:36 i-garrison has joined #freedesktop

10:36 <daniels> oh well :)

10:37 cluesch2 has joined #freedesktop

10:37 <daniels> it would be nice if upstream would use pg_restore -j$LOTS to restore instead of feeding the statements in one by one ...

10:38 <bentiss> does -j ensures the ordering?

10:39 <bentiss> maybe I should try velero once again

10:40 burn has joined #freedesktop

10:40 * daniels shrugs

10:40 <daniels> tbh it's already got to 2/3rds, I don't think there's much point stopping and pivoting now

10:40 burn has quit [Remote host closed the connection]

10:40 <daniels> we can deal with another hour, it's a nice sunny Sunday

10:40 <daniels> (it's 29C in London so I'm just assuming it has to be nice everywhere else as well, given how rarely this happens)

10:41 <daniels> -j does indeed do correct ordering, but relies on the file being block-accessible

10:41 <daniels> whereas rake does gzip -d | psql

10:41 blue__penquin has joined #freedesktop

10:41 <daniels> so I think for now probably just leave it be, and for next time we need to do a pg major upgrade we can do some experiments first to find the most optimal way?

10:42 <daniels> in particular -j offloads index building and does more batching, so that can help a lot with throughput

10:43 <bentiss> yeah, I am not planning on stopping the restore just now

10:44 <bentiss> I still wonder why the backup failed with velero. on the old cluster, I just changed the sts so it has the proper size for the disk, maybe that's the reason

10:45 <bentiss> 29C here as well, I wonder how it is possible we get the same temperature...

10:47 <bentiss> same result, velero failed :(

10:50 <daniels> ooi where are you running velero from?

10:51 <bentiss> started the command from my own desktop, but the restic/velero tasks are on packet (old-cluster)

10:51 <bentiss> restic is running on the same node than postgresql

10:52 <bentiss> so agent-2 here

10:57 <daniels> ah right, how do you get to the logs on ceph?

10:57 <bentiss> for restic, you can get them through k8s directly

10:58 <bentiss> kubectl -n velero get PodVolumeBackups postgresql-2021-06-13-12-44-ntwnl -o yaml

10:59 <bentiss> for velero normal logs: (on server-2) mc cat velero/velero-backups/backups/postgresql-2021-06-13-12-44/postgresql-2021-06-13-12-44-logs.gz | gunzip

11:02 blue__penquin has quit []

11:04 <daniels> ah right, that makes sense

11:06 <bentiss> though I can't find more logs than the one last line for restic :(

11:07 <bentiss> the 'error running restic backup, stderr=: signal: killed' is apparently because the pod gets OOM killed, but the node still had free memory when I was checking while the backup was happening

11:08 <bentiss> similar to https://github.com/vmware-tanzu/velero/issues/2073

11:08 <gitlab-bot> vmware-tanzu issue 2073 in velero "PartiallyFailed Backup because restic issue" [Question, Restic, Closed]

11:08 * bentiss is having lunch, bbl

11:14 kbabioch has joined #freedesktop

11:15 kbabioch has quit [Remote host closed the connection]

11:19 s-ol has joined #freedesktop

11:19 s-ol has quit [Remote host closed the connection]

11:29 <daniels> bentiss: at 44GB now so I guess we still have another hour+?

11:35 <daniels> heh, indexing the CI job table

11:37 <bentiss> seems like it, sigh

11:40 <bentiss> daniels: how did you manage to find the "indexing the CI job table"?

11:40 <daniels> bentiss: SELECT * FROM pg_stat_activity WHERE usename='gitlab';

11:41 <bentiss> heh, OK

11:41 <daniels> but it does seem like the data's all there, e.g. the notes table is fully populated

11:41 <daniels> which would be our biggest part after job logs + jobs themselves

11:41 <bentiss> ADD CONSTRAINT fk_d3130c9a7f FOREIGN KEY (commit_id) REFERENCES public.ci_pipelines(id) ON DELETE CASCADE;|client backend

11:41 <daniels> yeah, it's moved on :)

11:42 <bentiss> that's what I get now, so we must be passed the previous step

11:42 <daniels> yep!

11:42 <bentiss> \o/

11:43 <bentiss> FWIW, 1h25min left for sync-ing the uploads bucket

11:43 <daniels> and the ALTER TABLE ci_builds moved about 2min ago

11:43 <bentiss> though I interrupted it, previously to prioritise non-project/issues attached uplodas

11:43 <bentiss> so we should be fine switching over to ceph and wait for the sync to finish

11:43 <daniels> right, so it's a post-insert addition of all the public keys

11:44 <bentiss> *in the background

11:45 <daniels> I'd expect ci_build_trace_chunks + ci_build_trace_sections to take the longest; ci_builds itself took around 3min, so probably around the same time for each gives us another ... half an hour, making a wild guess?

11:45 logb0t has joined #freedesktop

11:45 logb0t has quit [Remote host closed the connection]

11:46 * bentiss hopes noone is taking bets

11:50 qyliss10 has joined #freedesktop

11:51 qyliss10 has quit [Remote host closed the connection]

11:55 <daniels> bentiss: oooh, it's done ...

11:56 <daniels> bentiss: but we need to point webservice at the new db

12:03 <bentiss> daniels: I got everything ready locally

12:03 <bentiss> first, we need to run gitlab-rake db:migrate according to the upgrade script

12:03 <daniels> yep

12:04 <daniels> I'm slightly out of date on omnibus btw; I have Grafana back at 7.4.5 instead of 8.0.0 and also I have minio-operator locally which isn't present in HA

12:04 <bentiss> k, no worries

12:05 <daniels> I can do db:migrate if you're still eating

12:05 <bentiss> daniels: my terminal after the db restore

12:05 <bentiss> https://paste.centos.org/view/c4a2ed1b

12:05 <bentiss> the clear redis task failed but that is not a problem I think

12:06 <daniels> yeah, that's fine, we should manually clear after bringing up the new one anyway

12:06 <bentiss> gitlab-rake db:migrate done

12:06 <daniels> nice!

12:06 <daniels> I'll scale psql + redis on old cluster down to 0

12:06 <bentiss> k

12:06 <daniels> done

12:08 <bentiss> running diff on the new deployment

12:09 <daniels> btw, just checked and cache:clear is the last task invoked by the backup-utility, so I think we're not missing anything

12:10 <bentiss> \o/

12:10 <bentiss> I deleted the sts postgres (with cascade=orphan to keep the pod running), and running sync as we speak

12:11 i-garrison has quit []

12:11 <bentiss> (got to remove it or the deployment would fail given that I changed the pvc request)

12:11 * daniels knocks incessantly on the wooden table he's sitting at, wooden chair he's sitting in

12:11 * bentiss has an ikea chair not made of wood....

12:13 <bentiss> everything is up according to k3s

12:13 <bentiss> pages is not...

12:14 * bentiss restarts the 3 gitaly pods

12:15 <bentiss> pages is down because of consolidated storage

12:15 <bentiss> FWIW

12:15 <bentiss> daniels: pushed my changes to omnibus repo, so you can have a look

12:19 <bentiss> apparently, the pages not up was just because it did not synced the replica number for pages (and sidekiq FWIW)

12:20 <bentiss> daniels: are you happy if I re-enable the correct IP so we can get runners working?

12:20 <daniels> awesome, I'm in sync now

12:21 <daniels> pushed marge changes to -config and that worked fine

12:21 <daniels> yeah, I think let's switch the IP and see what happens

12:22 <bentiss> ok IP back up I think

12:23 <bentiss> some jobs passed, so that's good

12:25 <daniels> yep, I can see job logs and artifacts too

12:25 <bentiss> \o/ pages deployment is working :)

12:26 <bentiss> OK, so in 15 minutes the uploads sync will finish and we can call it a day?

12:29 <bentiss> emails are flowing too :)

12:29 * bentiss just started a new backup because I had to interrupt todays

12:35 cluesch2 has left #freedesktop [#freedesktop]

12:37 <daniels> yep, Sidekiq seems very happy, the Rake check tasks are happy too :)

12:37 <daniels> I need to go out and get some groceries etc but I'll be generally around to keep an eye on things

12:37 <daniels> have tried various bits of functionality and so far nothing fallen over

12:38 <daniels> bentiss: thanks so much for all the migration! hope you enjoy the sunshine

12:39 <bentiss> daniels: thanks, have a nice afternoon too

12:41 <daniels> hrm, one thing when deleting gitaly nodes is that you have to somehow exclude the old one from the weight settings

12:41 <daniels> atm it's complaining that no-replicas is still in there

12:41 theh has joined #freedesktop

12:48 <daniels> k, fixed

12:54 <bentiss> cool, thanks

13:00 emanuele-f has joined #freedesktop

13:00 emanuele-f has quit [Remote host closed the connection]

13:23 i-garrison has joined #freedesktop

13:31 gandhiraj[m|gr] has joined #freedesktop

13:32 gandhiraj[m|gr] has quit [Remote host closed the connection]

13:39 ximion has joined #freedesktop

14:27 avar27 has joined #freedesktop

14:27 avar27 has quit [autokilled: Suspected spammer. Mail support@oftc.net with questions (2021-06-13 14:27:36)]

15:02 LOA_online has joined #freedesktop

15:03 LOA_online has quit [Remote host closed the connection]

15:07 LoKoMurdoK[m]1 has joined #freedesktop

15:07 LoKoMurdoK[m]1 has quit [Remote host closed the connection]

16:15 focus-u-f has joined #freedesktop

16:16 focus-u-f has quit [Remote host closed the connection]

18:15 pcglue has joined #freedesktop

18:16 pcglue has quit [autokilled: spambot. Dont mail support@oftc.net with questions (2021-06-13 18:16:07)]

18:33 Celphi has joined #freedesktop

18:33 Celphi has quit [Remote host closed the connection]

19:14 zooper_ has joined #freedesktop

19:15 zooper_ has quit [Remote host closed the connection]

19:30 buu____ has joined #freedesktop

19:30 buu____ has quit [Remote host closed the connection]

20:11 kmark has joined #freedesktop

20:12 kmark has quit [Remote host closed the connection]

20:58 <daniels> bentiss: weird, large-5 suddenly failing to do anything with 'container runtime is down'

21:14 <daniels> bentiss: ok, something has gone really really strange?

21:15 <daniels> bentiss: oh nm I just can't read, but did you do something with ufw at some point? k3s-large-5 can no longer connect to k3s-server-4 ...

21:18 chomwitt has quit [Ping timeout: 480 seconds]

21:18 <daniels> bentiss: I inserted some ufw rules on large-5 and server-4 to allow them to speak to each other (well, just all of 10.99.237.0/24) and all is well with the world

21:50 Ultrasauce is now known as sauce

22:12 danvet has quit [Ping timeout: 480 seconds]

23:17 i-garrison has quit []