#freedesktop on 2021-05-28 — irc logs at oftc.irclog.whitequark.org

00:31 FLHerne has quit [Quit: There's a real world out here!]

00:31 FLHerne has joined #freedesktop

00:41 ngcortes has quit [Remote host closed the connection]

00:51 blue__penquin has joined #freedesktop

00:52 adjtm has quit [Quit: Leaving]

00:53 blue__penquin has quit []

00:54 adjtm has joined #freedesktop

01:27 blue__penquin has joined #freedesktop

01:36 blue__penquin has quit []

02:02 aaronp has quit [Ping timeout: 480 seconds]

03:13 jarthur has quit [Ping timeout: 480 seconds]

03:51 blue__penquin has joined #freedesktop

04:06 ximion has quit []

04:31 blue__penquin has quit []

04:31 blue__penquin has joined #freedesktop

05:31 adjtm is now known as Guest257

05:31 adjtm has joined #freedesktop

05:37 Guest257 has quit [Ping timeout: 480 seconds]

05:47 hikiko has joined #freedesktop

06:27 alanc has quit [Remote host closed the connection]

06:34 danvet has joined #freedesktop

06:35 alanc has joined #freedesktop

07:06 <daniels> imirkin: kicking

07:06 <bentiss> daniels: I might have a lead on how storage is such unreliable

07:06 <bentiss> daniels: it seems large-1 is always causing io network issues

07:06 <daniels> bentiss: the NFS?

07:07 <daniels> oh right, MinIO

07:07 <bentiss> daniels:minio

07:07 <daniels> huh

07:08 * bentiss just restarted k3s on large-1

07:08 <bentiss> there were a bunch of "Another app is currently holding the xtables lock; still 3s 100000us time ahead to have a chance to grab the lock.."

07:09 <bentiss> I wonder if k3s was not able to set up properly the iptables rules, and that slowed down the entire traffic

07:10 <daniels> yeah, it might have been if I remember xtables right

07:11 <daniels> if not, can we easily get rid of it and spin up a new one?

07:11 <daniels> I might be able to do that today depending on how I feel

07:11 <daniels> (post-vaccine side effects, was definitely not a good idea to log into production machines yesterday)

07:12 <bentiss> daniels: in theory, yes, in practice, I never restarted a minio cluster on the fly

07:13 <daniels> heh

07:14 <daniels> btw I wonder should we merge packet-ha into the main branch?

07:15 <bentiss> hmm... crictl ps gives "connect: connect endpoint 'unix:///run/k3s/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded" not good

07:15 <bentiss> daniels: well, there are still a few bits that are not correct for the old cluster

07:15 <daniels> ack

07:15 <bentiss> but yes, packet-hashould be merged in the end

07:18 <bentiss> gitaly-0 is on large-1, but I can't drain it

07:18 <bentiss> so reboot I guess

07:42 <daniels> yeesh, openebs failing to start due to timeout

07:42 <bentiss> daniels: yes, this is bad, if we can not remount the openebs disk :(

07:43 <daniels> I don't know why it timed out though

07:43 <bentiss> too many files

07:43 <bentiss> maybe, but that would be a bummer

07:44 <bentiss> I managed to migrate gitaly-0 on server-2, so at least gitlab should be OK

07:45 <bentiss> not sure why it hangs at "starting default target" after giving Ctrl-D

07:45 <daniels> blurgh

07:46 * bentiss wonders if we better not spin up a new large server, and try to rebuild the minio server

07:46 <daniels> if you want to start doing that, I could try to look at the old one? starting with rebooting into rescue OS to reset the root pw so we can actually get a shell

07:47 <bentiss> hmm, the only problem is that I'll have to delete the PVC

07:47 <bentiss> well, I should be able to mark it as retain, and deploy the new one

07:47 <bentiss> daniels: OK, go for it

08:00 <bentiss> daniels: I wonder if large-1 is not physically busted. The network bootup seems a littl ebit too long compared to the one I just spun up

08:03 <daniels> bentiss: nothing in the logs, just XFS running recovery

08:04 <bentiss> daniels: is containerd working? (crictl ps)

08:05 <daniels> bentiss: containerd isn't working yet, because I'm still in the emergency shell

08:05 <daniels> (recovery is still going ...)

08:05 <bentiss> oh, right

08:06 <bentiss> should we then migrate to large-4 and decomisson large-1?

08:06 <daniels> yeah, do you know ... how we do that? :)

08:08 <bentiss> daniels: I would delete the pods and pvc on large-1, then wait for the operator to recreate them on large-4 and wait for minio to reconstruct itself

08:11 <bentiss> I should probably backup them first (the k8s resources I mean)

08:15 <daniels> yeah, I think this disk is dead

08:16 <daniels> well, one of the many disks

08:16 <daniels> mount is now stuck trying to acquire a rwsem for write

08:16 <daniels> and was only ever reading at 16MB/sec ...

08:20 <bentiss> right, so maybe not a good thing to keep it alive

08:21 <bentiss> I was checking yesterday the smart data, and they were not showing a lot of failures (i.e. in the 'OK' range)

08:34 <daniels> oh hey, it's back now

08:35 <daniels> just took ~40min to run through recovery

08:35 <daniels> ouch

08:35 <daniels> we might need to tune xfs params :P

08:35 <daniels> anyway, do you want me to bring it back up _without_ k3s, so we can get the raw data over to another machine?

08:35 <daniels> or bring it back with k3s if openebs is disabled?

08:35 <daniels> or ... ?

08:37 <bentiss> daniels: don't know what is best

08:37 <bentiss> right now, the gitlab pages bucket seems almost empty

08:37 <bentiss> so maybe getting back the data would be nice

08:39 * daniels nods

08:40 <daniels> how about I disable k3s-agent, boot to the full system, and then at least we have SSH :P

08:41 <bentiss> daniels:sounds good

08:43 <bentiss> I am under the impression that the data is not correctly spread accross the cluster, and that all the pages are on large-1...

08:51 <daniels> bentiss: there we go

08:51 <bentiss> daniels: yep, on it

08:51 <bentiss> daniels: the most urgent thing is to restore the pages

08:55 <daniels> to another node?

08:56 <daniels> because MinIO should handle the replication (shard across the larges) itself, and OpenEBS is only backing directly on to per-node storage, right?

08:57 <bentiss> daniels: correct, but I am under the impression that the gitlab-pages data is all on large-1

08:57 <bentiss> on the 2 nodes there

08:57 <bentiss> 2 pods, sorry

08:57 <daniels> ack

08:58 <bentiss> hmm, I tried rsync-ing one gitlab-pages item and it doesn't show up in minio :(

09:01 <daniels> should I just bring k3s up so we have the openebs + minio services on large-1, and we can figure out the sharding from there?

09:03 Seirdy has quit [Quit: exiting 3.2-dev]

09:03 <daniels> because I would guess that minio has to know about every shard

09:04 <daniels> the other option is that we rsync the underlying data across from large-1 to large-4, then bring large-4 up claiming to be large-1

09:04 blue__penquin has quit [Remote host closed the connection]

09:04 Seirdy has joined #freedesktop

09:04 blue__penquin has joined #freedesktop

09:08 <bentiss> well, don't knowwhich option will work TBH

09:09 <bentiss> I just synched the various pvc for just pipewire pages (ensureing each pvc is correctly mapped to its new), but the repo doesn't show up in minio

09:09 <bentiss> and healing doesn't help

09:10 <daniels> yeah, I guess the internal db will be inconsistent

09:11 <daniels> I'd be inclined to try to bring large-1 back in so it's seen as healthy, then figure out from there how we can rotate it out

09:11 <bentiss> ok

09:11 <daniels> but I also don't know MinIO/OpenEBS nearly as well as you, and only have half my brain at most right now, so ... :P

09:12 <bentiss> well, openebs is simply a dynamic provider of host-local

09:12 <daniels> right

09:12 <bentiss> so once the volume is bound, you can forget it

09:12 <bentiss> minio is a different beast

09:12 <daniels> but MinIO indeed needs to know the replication status

09:13 <bentiss> yeah, it should

09:14 <daniels> ok, so going to start k3s-agent on large-1 unless you scream

09:14 <bentiss> ok

09:14 * bentiss wonder if the shard db is not simply in the pvc

09:15 <bentiss> there is a .minio.sys

09:32 <bentiss> daniels: I have: 1. cordoned fdo-k3s-large-4 so the pod doesn't automatically start, 2. deleted the 2 minio pods on large-4 -> they are now in pending, 3. rsync all the .minio.sys folders from large-1 to large-4, 4. doing the same for the pages bucket (which is smaller than the entire artifacts one)

09:33 <daniels> yep, and large-1 doesn't appear to be running anything atm

09:33 <bentiss> the idea is to then uncordon large-4, spin it up and it should believe it is in the same state large-1 was

09:33 <bentiss> daniels: large-1 appears as ready (though cordoned)

09:35 blue__penquin has quit [Remote host closed the connection]

09:35 <bentiss> pipewire.org is back!

09:35 blue__penquin has joined #freedesktop

09:36 <daniels> \o/

09:36 <bentiss> that's the only one FWIW

09:36 <bentiss> :)

09:37 <daniels> hehe

09:40 <daniels> ooi is there any reason to have volumesPerServer==4, when each server only has a single OpenEBS backend (which in turn shards out to multiple disks via RAID)?

09:43 <bentiss> daniels: I definitively read that it was required, but can't find the doc right now

09:45 <daniels> oh right, fair enough

09:46 <bentiss> it's a matter of redundancy

09:47 <daniels> The total number of volumes in the Pool must be greater than 4. Specifically:

09:47 <daniels> servers X volumesPerServer > 4

09:47 <daniels> right, the redundancy makes sense in terms of having multiple volumes to shard through and keep redundancy

09:47 <daniels> but if we have 4 PV(C)s all backing on to the same actual OpenEBS storage, it doesn't give us much?

09:48 <daniels> so if we have 4 servers + 1 volume per server, then we still have 4 volumes total

09:48 <bentiss> maybe?

09:48 <bentiss> that would simplify things

09:49 <bentiss> daniels: so it doesn't seem to pick up the other pages

09:50 <bentiss> daniels: I guess pipewire is back because a pipeline was run :(

09:50 <daniels> bentiss: '2.2 MiB Used, 4 Buckets, 2 Objects'

09:51 <daniels> (from mc admin info on the artifacts svc)

09:51 <bentiss> ouch

09:55 <bentiss> well, the data still seems to be on disk

10:02 <bentiss> daniels: guess what? we doin't have pages backups :(

10:04 <daniels> bentiss: c'est la vie

10:04 <daniels> I can just walk through all the repos with pages enabled and trigger pipeline runs for them

10:05 <bentiss> daniels: that would be nice

10:06 * daniels cracks knuckles, opens Rails

10:07 <bentiss> the thing that worries me is that I can no longer log in the minio console

10:07 <bentiss> maybe we should empty large-4 first

10:09 <bentiss> daniels: before hitting run, I'll try to empty the large-4 disks so we start from blank again

10:13 <bentiss> hmm, that was not a good idea

10:17 <bentiss> daniels: giving up for nopw and relying on you for the next couple of hours

10:17 * bentiss <- lunch break

10:18 <bentiss> daniels: I wonder if we should not have a separate minio for pages

10:18 <bentiss> it should be much quicker to heal

10:25 <daniels> bentiss: hmm, so we should start from a completely new MinIO cluster, or ... ?

10:25 * daniels confused

10:25 <daniels> (sorry, delivery)

10:58 ximion has joined #freedesktop

11:06 <bentiss> daniels: no worries. we can always set up a new minio later

11:07 <daniels> bentiss: right, I just mean that ... should we reuse the current MinIO cluster with new PVCs (but keep the old PVs) now?

11:07 <bentiss> i also wonder if we should not simply just use 2 plain minios and use server replication

11:08 <bentiss> daniels up to you

11:09 <bentiss> also, we should still have the old minio data on the old cluster. server-2 should be configured to access both. a mc mirror should do the trick

11:09 <bentiss> (sorry on the phone)

11:15 <daniels> ok, so we keep the existing minio-artifacts operator, start with new PVCs (keeping the old PVs), then mc mirror from the old cluster -> minio-artifacts?

11:21 <bentiss> for pages? or entire artifatcs?

11:22 <bentiss> I don't think we b need to clear the pvc btw

11:22 <bentiss> it was.working before I left fof lunch

11:31 <daniels> sorry, I'm just confused about what we're doing with large-4 disks and the cluster ... if we're just recreating pages then I'll go ahead and do that, but I'd thought from what you said that you wanted to do something to the MinIO cluster before we started repopulating it

11:43 blue__penquin has quit []

11:53 ximion has quit []

11:57 <bentiss> If you are motivatdd enough, maybe add a new tenant in minio-tenant chart, then we mirror from the old cluster to this new

11:58 <bentiss> sorry should be back in front of my computer in ~45 min

12:01 thaytan has joined #freedesktop

12:05 <daniels> ok, gotcha, let me see if I can do that

12:17 <pq> https://wayland.freedesktop.org/ is 404 - expected?

12:28 <bentiss> pq: yes, see all the backlog since this morning :(

12:29 <pq> unfortunately I can't decipher what all that means, but thanks

12:29 <bentiss> pq: we had one server having networking issues

12:29 <bentiss> we thought we could just reboot it, but it basically broke the entire artifatcs/pages file servers

12:30 <bentiss> daniels: any progress on setting up the new minio? (I am now in front of a computer)

12:31 <pq> If you want to avoid stupid questions, maybe /topic with known issues would help. :-)

12:31 <bentiss> daniels: ^^

12:32 <pq> or is there perhaps a public web page showing service status already?

12:38 <bentiss> pq: not sure if there is

12:39 <bentiss> daniels: deploying a new minio-pages for pages

12:48 <daniels> bentiss: sorry, got dragged away by IRL things :(

12:48 <bentiss> daniels: now worries

12:48 <daniels> so I'll go on to regenerating pages if you're deploying the new host? or hold off on that

12:48 <bentiss> daniels: hold off, I am almost ready for pages

12:57 <bentiss> daniels: alright, we might need to respin the pages pipelines

12:57 <bentiss> I've rsync to what we were last week, and wayland.fd.o is still 404

13:01 pv has quit [Quit: Quit]

13:19 bittin^ has joined #freedesktop

13:45 sumits has joined #freedesktop

13:45 <bentiss> daniels: it is somehow working. I managed to redeploy wayland.fd.o by running the pipeline

13:45 <bentiss> but I really should get rid of the old cluster

14:02 <bentiss> daniels: I think I am done for today (I was supposed to be off BTW) -> all custom pages are up but poppler.fd.o and docs.mesa3d.org

14:05 sumits has quit [Ping timeout: 480 seconds]

14:06 bittin^ has left #freedesktop [#freedesktop]

14:32 bittin^_ has joined #freedesktop

14:42 aaronp has joined #freedesktop

16:06 hir0pro has joined #freedesktop

16:08 <bentiss> so far only one 500 error at 15:16 (3 hours ago)

16:08 <bentiss> looks like this server was indeed busted

16:10 bittin^_ has quit [Remote host closed the connection]

16:16 aaronp has quit [Remote host closed the connection]

16:16 aaronp has joined #freedesktop

16:28 ngcortes has joined #freedesktop

16:42 alanc has quit [Quit: Leaving]

16:43 alanc has joined #freedesktop

16:45 alanc has quit []

16:45 alanc has joined #freedesktop

20:02 hir0pro has quit [Ping timeout: 480 seconds]

20:45 thaytan has quit [Ping timeout: 480 seconds]

20:46 thaytan has joined #freedesktop

21:10 danvet has quit [Ping timeout: 480 seconds]

22:10 V has quit [Remote host closed the connection]

22:11 V has joined #freedesktop

22:25 ngcortes has quit [Remote host closed the connection]

23:17 ximion has joined #freedesktop