ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
immibis has quit [Remote host closed the connection]
Lyude has quit [Read error: Connection reset by peer]
Lyude has joined #freedesktop
ximion has joined #freedesktop
zredshift[m] has quit []
ximion has quit []
Rainer_Bielefeld_away has joined #freedesktop
alanc has quit [Remote host closed the connection]
<kisak>
fwiw, gitlab.fd.o appears to be inaccessible here.
<Rayyan>
kisak: yep
ximion has joined #freedesktop
Rainer_Bielefeld_away has joined #freedesktop
Sevenhill has joined #freedesktop
<daniels>
trying to fix
<Sevenhill>
is this will be whole day job or can we access it within a few hours ?
<kisak>
troubleshooting servers doesn't work that way? There's never a way to give an ETA up front, only if something profoundly bad happens and extra people have to get involved.
<daniels>
Sevenhill: I hope shortly
<daniels>
bentiss: so I'm losing gitaly-2 with 'rbd image replicapool-ssd/csi-vol-fb66e2ed-d5f8-11ec-9c25-266a9a9a89cb is still being used' ... I think that may be a consequence of large-7 having gone down uncleanly
<Sevenhill>
daniels: thank you for working on it
<daniels>
Sevenhill: np
___nick___ has quit []
___nick___ has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
sylware has joined #freedesktop
<daniels>
bentiss: it looks like osd-14 is unhealthy after all this - can you remind me again how to recover from this?
anholt_ has quit [Ping timeout: 480 seconds]
<bentiss>
daniels: sorry I can't help you now. I'll have a look tonight when I come back home
<daniels>
bentiss: no prob :) thanks
tinybronca[m] has joined #freedesktop
scrumplex_ has quit []
scrumplex has joined #freedesktop
pixelcluster has quit [Ping timeout: 480 seconds]
danvet has joined #freedesktop
sylware has quit [Remote host closed the connection]
greeb- has left #freedesktop [#freedesktop]
sravn has joined #freedesktop
Rainer_Bielefeld_away has quit [Remote host closed the connection]
<daniels>
bentiss: btw, where we are now is that osd-14 (one of the replicapool-ssd from large-5) is dead and refusing to come back up; the rest are complaining because they're backfillfull
<daniels>
my thinking is to nuke osd-14 and let it rebuild, but I'm not 100% sure how to do that non-destructively atm!
pixelcluster has joined #freedesktop
talisein has joined #freedesktop
pohly has joined #freedesktop
AbleBacon has joined #freedesktop
mihalycsaba has joined #freedesktop
<bentiss>
daniels: back home, starting to look into this
mihalycsaba has quit []
mihalycsaba has joined #freedesktop
<bentiss>
daniels: looks like osd 0 is also dead, which explains the backfillfull
<daniels>
bentiss: oh right, I thought osd 0 being dead was already known
<bentiss>
well, 2 ssds down is too much :(
<daniels>
but OSD 0 is on server-3 and has 0 objects in it
<daniels>
so I don't think it's any loss to the pool?
<daniels>
(btw the toolbox pod wasn't working due to upstream changes, so I spun up rook-ceph-tools-daniels as a new pod)
<bentiss>
we should have 2 ssds in each server-*, and we got only one in server-3, so we are missing one
<bentiss>
and regarding loss, we should be OK IMO
<bentiss>
but I'd rather first work on osd-0, then osd-14 (should be the same procedure)
* daniels
nods
<bentiss>
basically the process is: remove the OSD from the cluster, wipe the data on the disk, then remove the deployment on kubectl, then restart rook operator
<bentiss>
I just need to find the link where it's all written down :)
<daniels>
oh hmm, osd-0 is on server-3 which is throwing OOM every time it tries to spin up
<bentiss>
oh. not good
<daniels>
blah, that same assert fail: 'bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x6fc8400000, block size 0x1000, free 0xb036b3000, fragmentation 0.429172, allocated 0x0'
<daniels>
I wonder if this would be fixed in a newer version of ceph
<bentiss>
maybe, but restoring the OSD works
<bentiss>
give me a minute to find the proper link
<daniels>
thanks! I was looking at those and they seemed like the right thing to do, but I also didn't really want to find out in prod tbh :P
<bentiss>
so it's a purge on the osd
<bentiss>
heh
<bentiss>
so, process is make sure osd is out and cluster is backfilled properly, then purge the osd, then zap the disk, then remove the osd deployment and operator restart
<bentiss>
why the UI can not purge OSD-0???
<bentiss>
purge is not working... maybe we can just zap the disk and destroy the deployment
<bentiss>
server-3 is not responding to the various zap commands, rebooting it
marex has joined #freedesktop
pohly has quit []
<daniels>
urgh …
___nick___ has quit [Ping timeout: 480 seconds]
Consolatis has joined #freedesktop
Haaninjo has quit [Quit: Ex-Chat]
<bentiss>
managed to remove the OSB from ceph
<bentiss>
oh boy, zxapping the disks while the OSD was still known in the cluster was a bad idea... not bad in terms of we are doomed, but bad in terms of now I need to clean up the mess :/
<bentiss>
daniels: so maybe we need to spin up a temporary new server-* so ceph goes back to a sane state
<bentiss>
and we can the also nuke server-3 while we are at it and keep only the new one
<daniels>
bentiss: oh right, so it can populate that with the new content, then we can kill the old one behind the scenes?
<daniels>
right
<bentiss>
yep
<daniels>
that sounds good to me - I just need to go do some stuff but will be back in an hour if you need me for anything
<bentiss>
k, I'll try to make the magic happen in the interim :)
<daniels>
great, ty :)
<daniels>
sorry, just back from holiday and still haven't unpacked etc!
<bentiss>
no worries
<bentiss>
damn... "422 ewr1 is not a valid facility"
<daniels>
bentiss: yeah, have a look at gitlab-runner-packet.sh - it's all been changed around a fair bit
<bentiss>
daniels: so far I was able to have a new server in ny with a few changes
<bentiss>
hopefully it'll bind to the elastic ip
danvet has quit [Ping timeout: 480 seconds]
<daniels>
nice :)
<bentiss>
and worse case, we'll migrate our current servers to this new NY
<bentiss>
huh, it can not contact the other control planes
<bentiss>
seems to be working now
<bentiss>
damn, smae error: RuntimeError: Unable to create a new OSD id
mihalycsaba has quit []
<bentiss>
daniels: sigh, the new server doesn't even survive a reboot, it fails at finding the root
<bentiss>
I guess my cloud-init script killed the root
<bentiss>
FWIW, reinstalling it
<daniels>
mmm yeah, you might want to look at the runner generate-cloud-init.py changes too
<bentiss>
I think the issue was that jq was missing
<bentiss>
re-ouch
<bentiss>
daniels: currently upgrading k3s to the latest 1.20 or it is downgrading the new server which seems to make thinks not OK
Consolatis_ has joined #freedesktop
Consolatis is now known as Guest1893
Consolatis_ is now known as Consolatis
<daniels>
yeah, the changes in the runner script should fix jq too
<bentiss>
there is something else too, because I just added a new mount, made sure I formatted the correct disk, and reboot failed
<bentiss>
so I am re-installing it
<bentiss>
daniels: also, FYI there are 3 osd down without info, this is expected. I re-added them because we can not clean them up properly while we are backfil_full
agd5f has joined #freedesktop
agd5f has quit [Remote host closed the connection]
Guest1893 has quit [Ping timeout: 480 seconds]
agd5f has joined #freedesktop
<bentiss>
giving up debian_10, I double checked that the server was using the proper uuid, the disk was correct, did nothing in cloud-init related to disk, and it doesn't survive reboot
mihalycsaba has joined #freedesktop
<bentiss>
alright, debian_11 works way better
mihalycsaba has quit []
<daniels>
yeah, debian_10 isn’t readily available for the machine types which are …
<bentiss>
sigh, the route "10.0.0.0/8 via 10.66.151.2 dev bond0" is messing with our wireguard config
<bentiss>
daniels: so I have cordoned server-5 because it clearly can not reliably talk to the control plane
<bentiss>
daniels: `curl -v https://10.41.0.1` fails way too often, so I wonder what is the issue
Sevenhill has quit [Quit: Page closed]
<daniels>
bentiss: hmmm really? it seems to work ok at least atm ...
<bentiss>
it's way too late here, and I have to wake up in 6h to bring the kids at school, giving up for now, I'll continue tomorrow
<daniels>
but I wonder
<daniels>
10.0.0.0/8 via 10.99.237.154 dev bond0
<daniels>
10.40.0.0/16 dev flannel.1 scope link
<daniels>
oh no, nm
<bentiss>
I changed the default route FWIW
<bentiss>
and scoped it to 10.66.0.0/16
<bentiss>
while it was 10.0.0.0/8
<daniels>
yeah, it still has the /8 on -5
<bentiss>
server-5???
<bentiss>
or large-5?
<bentiss>
because I do not see it on server-5
<daniels>
sorry, wrong -5 :(
<daniels>
I think it might be too late for me too tbh
<bentiss>
OK, let's call it a day, and work on it tomorrow
<bentiss>
cause we might do more harm than anything
<daniels>
yeah ...
<daniels>
I definitely need to understand more about how kilo is supposed to work as well
pobrn has quit [Ping timeout: 480 seconds]
GNUmoon has quit [Remote host closed the connection]