#freedesktop on 2022-06-12 — irc logs at oftc.irclog.whitequark.org

2022-03-22 11:57 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:06 immibis has quit [Remote host closed the connection]

00:28 Lyude has quit [Read error: Connection reset by peer]

00:30 Lyude has joined #freedesktop

00:49 ximion has joined #freedesktop

01:09 zredshift[m] has quit []

02:58 ximion has quit []

05:52 Rainer_Bielefeld_away has joined #freedesktop

06:29 alanc has quit [Remote host closed the connection]

06:29 alanc has joined #freedesktop

08:13 Rainer_Bielefeld_away has quit []

08:32 pohly has joined #freedesktop

08:37 pohly has quit []

08:38 Haaninjo has joined #freedesktop

10:22 ___nick___ has joined #freedesktop

11:01 ___nick___ has quit []

11:19 ___nick___ has joined #freedesktop

11:35 Rainer_Bielefeld_away has joined #freedesktop

11:52 Rainer_Bielefeld_away has quit []

12:12 <austriancoder> https://gitlab.freedesktop.org/ == 502

12:22 pixelcluster has joined #freedesktop

12:24 jstein has joined #freedesktop

12:31 vyivel_ is now known as vyivel

12:31 <vyivel> 504 over here

12:57 <austriancoder> now its a 504 :)

13:03 Rayyan has joined #freedesktop

13:04 <Rayyan> what happened to https://gitlab.freedesktop.org ?

13:18 eroux has joined #freedesktop

13:28 pobrn has joined #freedesktop

13:34 pobrn has quit [Quit: Konversation terminated!]

13:35 pobrn has joined #freedesktop

13:37 kisak has joined #freedesktop

13:38 <kisak> fwiw, gitlab.fd.o appears to be inaccessible here.

13:53 <Rayyan> kisak: yep

14:12 ximion has joined #freedesktop

14:13 Rainer_Bielefeld_away has joined #freedesktop

14:35 Sevenhill has joined #freedesktop

14:37 <daniels> trying to fix

14:57 <Sevenhill> is this will be whole day job or can we access it within a few hours ?

15:08 <kisak> troubleshooting servers doesn't work that way? There's never a way to give an ETA up front, only if something profoundly bad happens and extra people have to get involved.

15:12 <daniels> Sevenhill: I hope shortly

15:13 <daniels> bentiss: so I'm losing gitaly-2 with 'rbd image replicapool-ssd/csi-vol-fb66e2ed-d5f8-11ec-9c25-266a9a9a89cb is still being used' ... I think that may be a consequence of large-7 having gone down uncleanly

15:13 <Sevenhill> daniels: thank you for working on it

15:19 <daniels> Sevenhill: np

15:20 ___nick___ has quit []

15:22 ___nick___ has joined #freedesktop

15:23 ___nick___ has quit []

15:24 ___nick___ has joined #freedesktop

15:44 sylware has joined #freedesktop

15:58 <daniels> bentiss: it looks like osd-14 is unhealthy after all this - can you remind me again how to recover from this?

16:21 anholt_ has quit [Ping timeout: 480 seconds]

16:53 <bentiss> daniels: sorry I can't help you now. I'll have a look tonight when I come back home

16:53 <daniels> bentiss: no prob :) thanks

17:21 tinybronca[m] has joined #freedesktop

17:24 scrumplex_ has quit []

17:24 scrumplex has joined #freedesktop

17:42 pixelcluster has quit [Ping timeout: 480 seconds]

17:45 danvet has joined #freedesktop

17:54 sylware has quit [Remote host closed the connection]

17:54 greeb- has left #freedesktop [#freedesktop]

17:56 sravn has joined #freedesktop

17:57 Rainer_Bielefeld_away has quit [Remote host closed the connection]

18:15 <daniels> bentiss: btw, where we are now is that osd-14 (one of the replicapool-ssd from large-5) is dead and refusing to come back up; the rest are complaining because they're backfillfull

18:15 <daniels> my thinking is to nuke osd-14 and let it rebuild, but I'm not 100% sure how to do that non-destructively atm!

18:33 pixelcluster has joined #freedesktop

18:42 talisein has joined #freedesktop

18:54 pohly has joined #freedesktop

18:54 AbleBacon has joined #freedesktop

19:19 mihalycsaba has joined #freedesktop

19:22 <bentiss> daniels: back home, starting to look into this

19:22 mihalycsaba has quit []

19:22 mihalycsaba has joined #freedesktop

19:24 <bentiss> daniels: looks like osd 0 is also dead, which explains the backfillfull

19:27 <daniels> bentiss: oh right, I thought osd 0 being dead was already known

19:27 <bentiss> well, 2 ssds down is too much :(

19:28 <daniels> but OSD 0 is on server-3 and has 0 objects in it

19:28 <daniels> so I don't think it's any loss to the pool?

19:29 <daniels> (btw the toolbox pod wasn't working due to upstream changes, so I spun up rook-ceph-tools-daniels as a new pod)

19:29 <bentiss> we should have 2 ssds in each server-*, and we got only one in server-3, so we are missing one

19:29 <bentiss> and regarding loss, we should be OK IMO

19:30 <bentiss> but I'd rather first work on osd-0, then osd-14 (should be the same procedure)

19:31 * daniels nods

19:32 <bentiss> basically the process is: remove the OSD from the cluster, wipe the data on the disk, then remove the deployment on kubectl, then restart rook operator

19:32 <bentiss> I just need to find the link where it's all written down :)

19:32 <daniels> oh hmm, osd-0 is on server-3 which is throwing OOM every time it tries to spin up

19:32 <bentiss> oh. not good

19:33 <daniels> blah, that same assert fail: 'bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x6fc8400000, block size 0x1000, free 0xb036b3000, fragmentation 0.429172, allocated 0x0'

19:33 <daniels> I wonder if this would be fixed in a newer version of ceph

19:34 <bentiss> maybe, but restoring the OSD works

19:34 <bentiss> give me a minute to find the proper link

19:35 <bentiss> to zap the disk: https://rathpc.github.io/rook.github.io/docs/rook/v1.4/ceph-teardown.html#zapping-devices

19:35 <bentiss> https://rook.github.io/docs/rook/v1.4/ceph-teardown.html#zapping-devices even (same data)

19:36 <daniels> yeah, loks like ti should be

19:36 <bentiss> and https://rook.github.io/docs/rook/v1.4/ceph-osd-mgmt.html#remove-an-osd

19:36 <daniels> thanks! I was looking at those and they seemed like the right thing to do, but I also didn't really want to find out in prod tbh :P

19:36 <bentiss> so it's a purge on the osd

19:37 <bentiss> heh

19:38 <bentiss> so, process is make sure osd is out and cluster is backfilled properly, then purge the osd, then zap the disk, then remove the osd deployment and operator restart

19:38 <bentiss> why the UI can not purge OSD-0???

19:41 <bentiss> purge is not working... maybe we can just zap the disk and destroy the deployment

19:49 <bentiss> server-3 is not responding to the various zap commands, rebooting it

19:51 marex has joined #freedesktop

19:58 pohly has quit []

20:06 <daniels> urgh …

20:08 ___nick___ has quit [Ping timeout: 480 seconds]

20:08 Consolatis has joined #freedesktop

20:12 Haaninjo has quit [Quit: Ex-Chat]

20:14 <bentiss> managed to remove the OSB from ceph

20:21 <bentiss> oh boy, zxapping the disks while the OSD was still known in the cluster was a bad idea... not bad in terms of we are doomed, but bad in terms of now I need to clean up the mess :/

20:34 <bentiss> daniels: so maybe we need to spin up a temporary new server-* so ceph goes back to a sane state

20:34 <bentiss> and we can the also nuke server-3 while we are at it and keep only the new one

20:35 <daniels> bentiss: oh right, so it can populate that with the new content, then we can kill the old one behind the scenes?

20:35 <daniels> right

20:35 <bentiss> yep

20:35 <daniels> that sounds good to me - I just need to go do some stuff but will be back in an hour if you need me for anything

20:35 <bentiss> k, I'll try to make the magic happen in the interim :)

20:35 <daniels> great, ty :)

20:36 <daniels> sorry, just back from holiday and still haven't unpacked etc!

20:36 <bentiss> no worries

20:39 <bentiss> damn... "422 ewr1 is not a valid facility"

21:02 <daniels> bentiss: yeah, have a look at gitlab-runner-packet.sh - it's all been changed around a fair bit

21:02 <bentiss> daniels: so far I was able to have a new server in ny with a few changes

21:02 <bentiss> hopefully it'll bind to the elastic ip

21:02 danvet has quit [Ping timeout: 480 seconds]

21:03 <daniels> nice :)

21:03 <bentiss> and worse case, we'll migrate our current servers to this new NY

21:05 <bentiss> huh, it can not contact the other control planes

21:09 <bentiss> seems to be working now

21:10 <bentiss> damn, smae error: RuntimeError: Unable to create a new OSD id

21:22 mihalycsaba has quit []

21:24 <bentiss> daniels: sigh, the new server doesn't even survive a reboot, it fails at finding the root

21:27 <bentiss> I guess my cloud-init script killed the root

21:35 <bentiss> FWIW, reinstalling it

21:48 <daniels> mmm yeah, you might want to look at the runner generate-cloud-init.py changes too

21:48 <daniels> particularly 89661c37ea2f0cef663e762a18f4a3a600f8356f

21:50 <bentiss> I think the issue was that jq was missing

21:52 <bentiss> re-ouch

21:55 <bentiss> daniels: currently upgrading k3s to the latest 1.20 or it is downgrading the new server which seems to make thinks not OK

22:01 Consolatis_ has joined #freedesktop

22:01 Consolatis is now known as Guest1893

22:01 Consolatis_ is now known as Consolatis

22:02 <daniels> yeah, the changes in the runner script should fix jq too

22:04 <bentiss> there is something else too, because I just added a new mount, made sure I formatted the correct disk, and reboot failed

22:04 <bentiss> so I am re-installing it

22:06 <bentiss> daniels: also, FYI there are 3 osd down without info, this is expected. I re-added them because we can not clean them up properly while we are backfil_full

22:07 agd5f has joined #freedesktop

22:07 agd5f has quit [Remote host closed the connection]

22:07 Guest1893 has quit [Ping timeout: 480 seconds]

22:08 agd5f has joined #freedesktop

22:14 <bentiss> giving up debian_10, I double checked that the server was using the proper uuid, the disk was correct, did nothing in cloud-init related to disk, and it doesn't survive reboot

22:23 mihalycsaba has joined #freedesktop

22:25 <bentiss> alright, debian_11 works way better

22:26 mihalycsaba has quit []

22:35 <daniels> yeah, debian_10 isn’t readily available for the machine types which are …

22:43 <bentiss> sigh, the route "10.0.0.0/8 via 10.66.151.2 dev bond0" is messing with our wireguard config

22:47 <bentiss> daniels: so I have cordoned server-5 because it clearly can not reliably talk to the control plane

22:48 <bentiss> daniels: `curl -v https://10.41.0.1` fails way too often, so I wonder what is the issue

22:58 Sevenhill has quit [Quit: Page closed]

23:06 <daniels> bentiss: hmmm really? it seems to work ok at least atm ...

23:06 <bentiss> it's way too late here, and I have to wake up in 6h to bring the kids at school, giving up for now, I'll continue tomorrow

23:06 <daniels> but I wonder

23:06 <daniels> 10.0.0.0/8 via 10.99.237.154 dev bond0

23:06 <daniels> 10.40.0.0/16 dev flannel.1 scope link

23:06 <daniels> oh no, nm

23:06 <bentiss> I changed the default route FWIW

23:06 <bentiss> and scoped it to 10.66.0.0/16

23:07 <bentiss> while it was 10.0.0.0/8

23:07 <daniels> yeah, it still has the /8 on -5

23:08 <bentiss> server-5???

23:08 <bentiss> or large-5?

23:08 <bentiss> because I do not see it on server-5

23:08 <daniels> sorry, wrong -5 :(

23:08 <daniels> I think it might be too late for me too tbh

23:08 <bentiss> OK, let's call it a day, and work on it tomorrow

23:09 <bentiss> cause we might do more harm than anything

23:10 <daniels> yeah ...

23:10 <daniels> I definitely need to understand more about how kilo is supposed to work as well

23:28 pobrn has quit [Ping timeout: 480 seconds]

23:37 GNUmoon has quit [Remote host closed the connection]

23:41 mihalycsaba has joined #freedesktop

23:44 mihalycsaba has quit []

23:44 jstein has quit []