FLHerne has quit [Quit: There's a real world out here!]
FLHerne has joined #freedesktop
ngcortes has quit [Remote host closed the connection]
blue__penquin has joined #freedesktop
adjtm has quit [Quit: Leaving]
blue__penquin has quit []
adjtm has joined #freedesktop
blue__penquin has joined #freedesktop
blue__penquin has quit []
aaronp has quit [Ping timeout: 480 seconds]
jarthur has quit [Ping timeout: 480 seconds]
blue__penquin has joined #freedesktop
ximion has quit []
blue__penquin has quit []
blue__penquin has joined #freedesktop
adjtm is now known as Guest257
adjtm has joined #freedesktop
Guest257 has quit [Ping timeout: 480 seconds]
hikiko has joined #freedesktop
alanc has quit [Remote host closed the connection]
danvet has joined #freedesktop
alanc has joined #freedesktop
<daniels> imirkin: kicking
<bentiss> daniels: I might have a lead on how storage is such unreliable
<bentiss> daniels: it seems large-1 is always causing io network issues
<daniels> bentiss: the NFS?
<daniels> oh right, MinIO
<bentiss> daniels:minio
<daniels> huh
* bentiss just restarted k3s on large-1
<bentiss> there were a bunch of "Another app is currently holding the xtables lock; still 3s 100000us time ahead to have a chance to grab the lock.."
<bentiss> I wonder if k3s was not able to set up properly the iptables rules, and that slowed down the entire traffic
<daniels> yeah, it might have been if I remember xtables right
<daniels> if not, can we easily get rid of it and spin up a new one?
<daniels> I might be able to do that today depending on how I feel
<daniels> (post-vaccine side effects, was definitely not a good idea to log into production machines yesterday)
<bentiss> daniels: in theory, yes, in practice, I never restarted a minio cluster on the fly
<daniels> heh
<daniels> btw I wonder should we merge packet-ha into the main branch?
<bentiss> hmm... crictl ps gives "connect: connect endpoint 'unix:///run/k3s/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded" not good
<bentiss> daniels: well, there are still a few bits that are not correct for the old cluster
<daniels> ack
<bentiss> but yes, packet-hashould be merged in the end
<bentiss> gitaly-0 is on large-1, but I can't drain it
<bentiss> so reboot I guess
<daniels> yeesh, openebs failing to start due to timeout
<bentiss> daniels: yes, this is bad, if we can not remount the openebs disk :(
<daniels> I don't know why it timed out though
<bentiss> too many files
<bentiss> maybe, but that would be a bummer
<bentiss> I managed to migrate gitaly-0 on server-2, so at least gitlab should be OK
<bentiss> not sure why it hangs at "starting default target" after giving Ctrl-D
<daniels> blurgh
* bentiss wonders if we better not spin up a new large server, and try to rebuild the minio server
<daniels> if you want to start doing that, I could try to look at the old one? starting with rebooting into rescue OS to reset the root pw so we can actually get a shell
<bentiss> hmm, the only problem is that I'll have to delete the PVC
<bentiss> well, I should be able to mark it as retain, and deploy the new one
<bentiss> daniels: OK, go for it
<bentiss> daniels: I wonder if large-1 is not physically busted. The network bootup seems a littl ebit too long compared to the one I just spun up
<daniels> bentiss: nothing in the logs, just XFS running recovery
<bentiss> daniels: is containerd working? (crictl ps)
<daniels> bentiss: containerd isn't working yet, because I'm still in the emergency shell
<daniels> (recovery is still going ...)
<bentiss> oh, right
<bentiss> should we then migrate to large-4 and decomisson large-1?
<daniels> yeah, do you know ... how we do that? :)
<bentiss> daniels: I would delete the pods and pvc on large-1, then wait for the operator to recreate them on large-4 and wait for minio to reconstruct itself
<bentiss> I should probably backup them first (the k8s resources I mean)
<daniels> yeah, I think this disk is dead
<daniels> well, one of the many disks
<daniels> mount is now stuck trying to acquire a rwsem for write
<daniels> and was only ever reading at 16MB/sec ...
<bentiss> right, so maybe not a good thing to keep it alive
<bentiss> I was checking yesterday the smart data, and they were not showing a lot of failures (i.e. in the 'OK' range)
<daniels> oh hey, it's back now
<daniels> just took ~40min to run through recovery
<daniels> ouch
<daniels> we might need to tune xfs params :P
<daniels> anyway, do you want me to bring it back up _without_ k3s, so we can get the raw data over to another machine?
<daniels> or bring it back with k3s if openebs is disabled?
<daniels> or ... ?
<bentiss> daniels: don't know what is best
<bentiss> right now, the gitlab pages bucket seems almost empty
<bentiss> so maybe getting back the data would be nice
* daniels nods
<daniels> how about I disable k3s-agent, boot to the full system, and then at least we have SSH :P
<bentiss> daniels:sounds good
<bentiss> I am under the impression that the data is not correctly spread accross the cluster, and that all the pages are on large-1...
<daniels> bentiss: there we go
<bentiss> daniels: yep, on it
<bentiss> daniels: the most urgent thing is to restore the pages
<daniels> to another node?
<daniels> because MinIO should handle the replication (shard across the larges) itself, and OpenEBS is only backing directly on to per-node storage, right?
<bentiss> daniels: correct, but I am under the impression that the gitlab-pages data is all on large-1
<bentiss> on the 2 nodes there
<bentiss> 2 pods, sorry
<daniels> ack
<bentiss> hmm, I tried rsync-ing one gitlab-pages item and it doesn't show up in minio :(
<daniels> should I just bring k3s up so we have the openebs + minio services on large-1, and we can figure out the sharding from there?
Seirdy has quit [Quit: exiting 3.2-dev]
<daniels> because I would guess that minio has to know about every shard
<daniels> the other option is that we rsync the underlying data across from large-1 to large-4, then bring large-4 up claiming to be large-1
blue__penquin has quit [Remote host closed the connection]
Seirdy has joined #freedesktop
blue__penquin has joined #freedesktop
<bentiss> well, don't knowwhich option will work TBH
<bentiss> I just synched the various pvc for just pipewire pages (ensureing each pvc is correctly mapped to its new), but the repo doesn't show up in minio
<bentiss> and healing doesn't help
<daniels> yeah, I guess the internal db will be inconsistent
<daniels> I'd be inclined to try to bring large-1 back in so it's seen as healthy, then figure out from there how we can rotate it out
<bentiss> ok
<daniels> but I also don't know MinIO/OpenEBS nearly as well as you, and only have half my brain at most right now, so ... :P
<bentiss> well, openebs is simply a dynamic provider of host-local
<daniels> right
<bentiss> so once the volume is bound, you can forget it
<bentiss> minio is a different beast
<daniels> but MinIO indeed needs to know the replication status
<bentiss> yeah, it should
<daniels> ok, so going to start k3s-agent on large-1 unless you scream
<bentiss> ok
* bentiss wonder if the shard db is not simply in the pvc
<bentiss> there is a .minio.sys
<bentiss> daniels: I have: 1. cordoned fdo-k3s-large-4 so the pod doesn't automatically start, 2. deleted the 2 minio pods on large-4 -> they are now in pending, 3. rsync all the .minio.sys folders from large-1 to large-4, 4. doing the same for the pages bucket (which is smaller than the entire artifacts one)
<daniels> yep, and large-1 doesn't appear to be running anything atm
<bentiss> the idea is to then uncordon large-4, spin it up and it should believe it is in the same state large-1 was
<bentiss> daniels: large-1 appears as ready (though cordoned)
blue__penquin has quit [Remote host closed the connection]
<bentiss> pipewire.org is back!
blue__penquin has joined #freedesktop
<daniels> \o/
<bentiss> that's the only one FWIW
<bentiss> :)
<daniels> hehe
<daniels> ooi is there any reason to have volumesPerServer==4, when each server only has a single OpenEBS backend (which in turn shards out to multiple disks via RAID)?
<bentiss> daniels: I definitively read that it was required, but can't find the doc right now
<daniels> oh right, fair enough
<bentiss> it's a matter of redundancy
<daniels> The total number of volumes in the Pool must be greater than 4. Specifically:
<daniels> servers X volumesPerServer > 4
<daniels> right, the redundancy makes sense in terms of having multiple volumes to shard through and keep redundancy
<daniels> but if we have 4 PV(C)s all backing on to the same actual OpenEBS storage, it doesn't give us much?
<daniels> so if we have 4 servers + 1 volume per server, then we still have 4 volumes total
<bentiss> maybe?
<bentiss> that would simplify things
<bentiss> daniels: so it doesn't seem to pick up the other pages
<bentiss> daniels: I guess pipewire is back because a pipeline was run :(
<daniels> bentiss: '2.2 MiB Used, 4 Buckets, 2 Objects'
<daniels> (from mc admin info on the artifacts svc)
<bentiss> ouch
<bentiss> well, the data still seems to be on disk
<bentiss> daniels: guess what? we doin't have pages backups :(
<daniels> bentiss: c'est la vie
<daniels> I can just walk through all the repos with pages enabled and trigger pipeline runs for them
<bentiss> daniels: that would be nice
* daniels cracks knuckles, opens Rails
<bentiss> the thing that worries me is that I can no longer log in the minio console
<bentiss> maybe we should empty large-4 first
<bentiss> daniels: before hitting run, I'll try to empty the large-4 disks so we start from blank again
<bentiss> hmm, that was not a good idea
<bentiss> daniels: giving up for nopw and relying on you for the next couple of hours
* bentiss <- lunch break
<bentiss> daniels: I wonder if we should not have a separate minio for pages
<bentiss> it should be much quicker to heal
<daniels> bentiss: hmm, so we should start from a completely new MinIO cluster, or ... ?
* daniels confused
<daniels> (sorry, delivery)
ximion has joined #freedesktop
<bentiss> daniels: no worries. we can always set up a new minio later
<daniels> bentiss: right, I just mean that ... should we reuse the current MinIO cluster with new PVCs (but keep the old PVs) now?
<bentiss> i also wonder if we should not simply just use 2 plain minios and use server replication
<bentiss> daniels up to you
<bentiss> also, we should still have the old minio data on the old cluster. server-2 should be configured to access both. a mc mirror should do the trick
<bentiss> (sorry on the phone)
<daniels> ok, so we keep the existing minio-artifacts operator, start with new PVCs (keeping the old PVs), then mc mirror from the old cluster -> minio-artifacts?
<bentiss> for pages? or entire artifatcs?
<bentiss> I don't think we b need to clear the pvc btw
<bentiss> it was.working before I left fof lunch
<daniels> sorry, I'm just confused about what we're doing with large-4 disks and the cluster ... if we're just recreating pages then I'll go ahead and do that, but I'd thought from what you said that you wanted to do something to the MinIO cluster before we started repopulating it
blue__penquin has quit []
ximion has quit []
<bentiss> If you are motivatdd enough, maybe add a new tenant in minio-tenant chart, then we mirror from the old cluster to this new
<bentiss> sorry should be back in front of my computer in ~45 min
thaytan has joined #freedesktop
<daniels> ok, gotcha, let me see if I can do that
<pq> https://wayland.freedesktop.org/ is 404 - expected?
<bentiss> pq: yes, see all the backlog since this morning :(
<pq> unfortunately I can't decipher what all that means, but thanks
<bentiss> pq: we had one server having networking issues
<bentiss> we thought we could just reboot it, but it basically broke the entire artifatcs/pages file servers
<bentiss> daniels: any progress on setting up the new minio? (I am now in front of a computer)
<pq> If you want to avoid stupid questions, maybe /topic with known issues would help. :-)
<bentiss> daniels: ^^
<pq> or is there perhaps a public web page showing service status already?
<bentiss> pq: not sure if there is
<bentiss> daniels: deploying a new minio-pages for pages
<daniels> bentiss: sorry, got dragged away by IRL things :(
<bentiss> daniels: now worries
<daniels> so I'll go on to regenerating pages if you're deploying the new host? or hold off on that
<bentiss> daniels: hold off, I am almost ready for pages
<bentiss> daniels: alright, we might need to respin the pages pipelines
<bentiss> I've rsync to what we were last week, and wayland.fd.o is still 404
pv has quit [Quit: Quit]
bittin^ has joined #freedesktop
sumits has joined #freedesktop
<bentiss> daniels: it is somehow working. I managed to redeploy wayland.fd.o by running the pipeline
<bentiss> but I really should get rid of the old cluster
<bentiss> daniels: I think I am done for today (I was supposed to be off BTW) -> all custom pages are up but poppler.fd.o and docs.mesa3d.org
sumits has quit [Ping timeout: 480 seconds]
bittin^ has left #freedesktop [#freedesktop]
bittin^_ has joined #freedesktop
aaronp has joined #freedesktop
hir0pro has joined #freedesktop
<bentiss> so far only one 500 error at 15:16 (3 hours ago)
<bentiss> looks like this server was indeed busted
bittin^_ has quit [Remote host closed the connection]
aaronp has quit [Remote host closed the connection]
aaronp has joined #freedesktop
ngcortes has joined #freedesktop
alanc has quit [Quit: Leaving]
alanc has joined #freedesktop
alanc has quit []
alanc has joined #freedesktop
hir0pro has quit [Ping timeout: 480 seconds]
thaytan has quit [Ping timeout: 480 seconds]
thaytan has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
V has quit [Remote host closed the connection]
V has joined #freedesktop
ngcortes has quit [Remote host closed the connection]
ximion has joined #freedesktop