ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
immibis has quit [Remote host closed the connection]
Lyude has quit [Read error: Connection reset by peer]
Lyude has joined #freedesktop
ximion has joined #freedesktop
zredshift[m] has quit []
ximion has quit []
Rainer_Bielefeld_away has joined #freedesktop
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
Rainer_Bielefeld_away has quit []
pohly has joined #freedesktop
pohly has quit []
Haaninjo has joined #freedesktop
___nick___ has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
Rainer_Bielefeld_away has joined #freedesktop
Rainer_Bielefeld_away has quit []
<austriancoder> https://gitlab.freedesktop.org/ == 502
pixelcluster has joined #freedesktop
jstein has joined #freedesktop
vyivel_ is now known as vyivel
<vyivel> 504 over here
<austriancoder> now its a 504 :)
Rayyan has joined #freedesktop
<Rayyan> what happened to https://gitlab.freedesktop.org ?
eroux has joined #freedesktop
pobrn has joined #freedesktop
pobrn has quit [Quit: Konversation terminated!]
pobrn has joined #freedesktop
kisak has joined #freedesktop
<kisak> fwiw, gitlab.fd.o appears to be inaccessible here.
<Rayyan> kisak: yep
ximion has joined #freedesktop
Rainer_Bielefeld_away has joined #freedesktop
Sevenhill has joined #freedesktop
<daniels> trying to fix
<Sevenhill> is this will be whole day job or can we access it within a few hours ?
<kisak> troubleshooting servers doesn't work that way? There's never a way to give an ETA up front, only if something profoundly bad happens and extra people have to get involved.
<daniels> Sevenhill: I hope shortly
<daniels> bentiss: so I'm losing gitaly-2 with 'rbd image replicapool-ssd/csi-vol-fb66e2ed-d5f8-11ec-9c25-266a9a9a89cb is still being used' ... I think that may be a consequence of large-7 having gone down uncleanly
<Sevenhill> daniels: thank you for working on it
<daniels> Sevenhill: np
___nick___ has quit []
___nick___ has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
sylware has joined #freedesktop
<daniels> bentiss: it looks like osd-14 is unhealthy after all this - can you remind me again how to recover from this?
anholt_ has quit [Ping timeout: 480 seconds]
<bentiss> daniels: sorry I can't help you now. I'll have a look tonight when I come back home
<daniels> bentiss: no prob :) thanks
tinybronca[m] has joined #freedesktop
scrumplex_ has quit []
scrumplex has joined #freedesktop
pixelcluster has quit [Ping timeout: 480 seconds]
danvet has joined #freedesktop
sylware has quit [Remote host closed the connection]
greeb- has left #freedesktop [#freedesktop]
sravn has joined #freedesktop
Rainer_Bielefeld_away has quit [Remote host closed the connection]
<daniels> bentiss: btw, where we are now is that osd-14 (one of the replicapool-ssd from large-5) is dead and refusing to come back up; the rest are complaining because they're backfillfull
<daniels> my thinking is to nuke osd-14 and let it rebuild, but I'm not 100% sure how to do that non-destructively atm!
pixelcluster has joined #freedesktop
talisein has joined #freedesktop
pohly has joined #freedesktop
AbleBacon has joined #freedesktop
mihalycsaba has joined #freedesktop
<bentiss> daniels: back home, starting to look into this
mihalycsaba has quit []
mihalycsaba has joined #freedesktop
<bentiss> daniels: looks like osd 0 is also dead, which explains the backfillfull
<daniels> bentiss: oh right, I thought osd 0 being dead was already known
<bentiss> well, 2 ssds down is too much :(
<daniels> but OSD 0 is on server-3 and has 0 objects in it
<daniels> so I don't think it's any loss to the pool?
<daniels> (btw the toolbox pod wasn't working due to upstream changes, so I spun up rook-ceph-tools-daniels as a new pod)
<bentiss> we should have 2 ssds in each server-*, and we got only one in server-3, so we are missing one
<bentiss> and regarding loss, we should be OK IMO
<bentiss> but I'd rather first work on osd-0, then osd-14 (should be the same procedure)
* daniels nods
<bentiss> basically the process is: remove the OSD from the cluster, wipe the data on the disk, then remove the deployment on kubectl, then restart rook operator
<bentiss> I just need to find the link where it's all written down :)
<daniels> oh hmm, osd-0 is on server-3 which is throwing OOM every time it tries to spin up
<bentiss> oh. not good
<daniels> blah, that same assert fail: 'bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x6fc8400000, block size 0x1000, free 0xb036b3000, fragmentation 0.429172, allocated 0x0'
<daniels> I wonder if this would be fixed in a newer version of ceph
<bentiss> maybe, but restoring the OSD works
<bentiss> give me a minute to find the proper link
<daniels> yeah, loks like ti should be
<daniels> thanks! I was looking at those and they seemed like the right thing to do, but I also didn't really want to find out in prod tbh :P
<bentiss> so it's a purge on the osd
<bentiss> heh
<bentiss> so, process is make sure osd is out and cluster is backfilled properly, then purge the osd, then zap the disk, then remove the osd deployment and operator restart
<bentiss> why the UI can not purge OSD-0???
<bentiss> purge is not working... maybe we can just zap the disk and destroy the deployment
<bentiss> server-3 is not responding to the various zap commands, rebooting it
marex has joined #freedesktop
pohly has quit []
<daniels> urgh …
___nick___ has quit [Ping timeout: 480 seconds]
Consolatis has joined #freedesktop
Haaninjo has quit [Quit: Ex-Chat]
<bentiss> managed to remove the OSB from ceph
<bentiss> oh boy, zxapping the disks while the OSD was still known in the cluster was a bad idea... not bad in terms of we are doomed, but bad in terms of now I need to clean up the mess :/
<bentiss> daniels: so maybe we need to spin up a temporary new server-* so ceph goes back to a sane state
<bentiss> and we can the also nuke server-3 while we are at it and keep only the new one
<daniels> bentiss: oh right, so it can populate that with the new content, then we can kill the old one behind the scenes?
<daniels> right
<bentiss> yep
<daniels> that sounds good to me - I just need to go do some stuff but will be back in an hour if you need me for anything
<bentiss> k, I'll try to make the magic happen in the interim :)
<daniels> great, ty :)
<daniels> sorry, just back from holiday and still haven't unpacked etc!
<bentiss> no worries
<bentiss> damn... "422 ewr1 is not a valid facility"
<daniels> bentiss: yeah, have a look at gitlab-runner-packet.sh - it's all been changed around a fair bit
<bentiss> daniels: so far I was able to have a new server in ny with a few changes
<bentiss> hopefully it'll bind to the elastic ip
danvet has quit [Ping timeout: 480 seconds]
<daniels> nice :)
<bentiss> and worse case, we'll migrate our current servers to this new NY
<bentiss> huh, it can not contact the other control planes
<bentiss> seems to be working now
<bentiss> damn, smae error: RuntimeError: Unable to create a new OSD id
mihalycsaba has quit []
<bentiss> daniels: sigh, the new server doesn't even survive a reboot, it fails at finding the root
<bentiss> I guess my cloud-init script killed the root
<bentiss> FWIW, reinstalling it
<daniels> mmm yeah, you might want to look at the runner generate-cloud-init.py changes too
<daniels> particularly 89661c37ea2f0cef663e762a18f4a3a600f8356f
<bentiss> I think the issue was that jq was missing
<bentiss> re-ouch
<bentiss> daniels: currently upgrading k3s to the latest 1.20 or it is downgrading the new server which seems to make thinks not OK
Consolatis_ has joined #freedesktop
Consolatis is now known as Guest1893
Consolatis_ is now known as Consolatis
<daniels> yeah, the changes in the runner script should fix jq too
<bentiss> there is something else too, because I just added a new mount, made sure I formatted the correct disk, and reboot failed
<bentiss> so I am re-installing it
<bentiss> daniels: also, FYI there are 3 osd down without info, this is expected. I re-added them because we can not clean them up properly while we are backfil_full
agd5f has joined #freedesktop
agd5f has quit [Remote host closed the connection]
Guest1893 has quit [Ping timeout: 480 seconds]
agd5f has joined #freedesktop
<bentiss> giving up debian_10, I double checked that the server was using the proper uuid, the disk was correct, did nothing in cloud-init related to disk, and it doesn't survive reboot
mihalycsaba has joined #freedesktop
<bentiss> alright, debian_11 works way better
mihalycsaba has quit []
<daniels> yeah, debian_10 isn’t readily available for the machine types which are …
<bentiss> sigh, the route "10.0.0.0/8 via 10.66.151.2 dev bond0" is messing with our wireguard config
<bentiss> daniels: so I have cordoned server-5 because it clearly can not reliably talk to the control plane
<bentiss> daniels: `curl -v https://10.41.0.1` fails way too often, so I wonder what is the issue
Sevenhill has quit [Quit: Page closed]
<daniels> bentiss: hmmm really? it seems to work ok at least atm ...
<bentiss> it's way too late here, and I have to wake up in 6h to bring the kids at school, giving up for now, I'll continue tomorrow
<daniels> but I wonder
<daniels> 10.0.0.0/8 via 10.99.237.154 dev bond0
<daniels> 10.40.0.0/16 dev flannel.1 scope link
<daniels> oh no, nm
<bentiss> I changed the default route FWIW
<bentiss> and scoped it to 10.66.0.0/16
<bentiss> while it was 10.0.0.0/8
<daniels> yeah, it still has the /8 on -5
<bentiss> server-5???
<bentiss> or large-5?
<bentiss> because I do not see it on server-5
<daniels> sorry, wrong -5 :(
<daniels> I think it might be too late for me too tbh
<bentiss> OK, let's call it a day, and work on it tomorrow
<bentiss> cause we might do more harm than anything
<daniels> yeah ...
<daniels> I definitely need to understand more about how kilo is supposed to work as well
pobrn has quit [Ping timeout: 480 seconds]
GNUmoon has quit [Remote host closed the connection]
mihalycsaba has joined #freedesktop
mihalycsaba has quit []
jstein has quit []