#freedesktop on 2021-09-05 — irc logs at oftc.irclog.whitequark.org

2021-07-26 22:56 ChanServ changed the topic of #freedesktop to:

00:22 danvet has quit [Ping timeout: 480 seconds]

00:53 pendingchaos_ has joined #freedesktop

00:56 pendingchaos has quit [Ping timeout: 480 seconds]

00:57 Thymo has joined #freedesktop

00:58 Thymo_ has quit [Ping timeout: 480 seconds]

00:59 Thymo has quit [Remote host closed the connection]

01:00 Thymo has joined #freedesktop

01:02 pendingchaos_ has quit [Ping timeout: 480 seconds]

01:04 pendingchaos has joined #freedesktop

02:02 Seirdy has quit [Ping timeout: 480 seconds]

02:04 Seirdy has joined #freedesktop

02:23 jarthur has joined #freedesktop

02:59 jarthur has quit [Ping timeout: 480 seconds]

05:44 ximion has quit []

06:38 alanc has quit [Remote host closed the connection]

06:39 alanc has joined #freedesktop

09:10 <bentiss> daniels: FWIW, definitively, the 502 errors are happening when we are uploading/downloading the backups over ceph

09:11 <bentiss> it was almost no 502 errors for the past few days, but I realized this morning the backups were faulting for the past 3 days

09:11 <bentiss> I am pulling the last backup (from Sep 1), and 502 and OSD with slow response tiome are happening a lot

09:34 <daniels> bentiss: I just noticed that k3s-large-6 is dead in the water, so I'm kicking it now

09:42 <bentiss> daniels: OK, thanks

09:44 <daniels> (k3s-agent fell down a hole and wasn't able to communicate anything to the primary, which I guess is not unrelated to the 300 loadavg and multiple OOM kills ...)

09:45 <bentiss> daniels: I am doing the pull of the backup on large-5, please do not kick that one until it's done

09:45 <daniels> bentiss: yep!

09:46 <daniels> (interestingly large-6 is currently logging connection failure to mon[02], 'connect error -101' which is ENETUNREACH and I guess means ... WireGuard dead?)

09:47 <bentiss> daniels: large-6 is still not booted

09:49 <daniels> yeah, coming through BIOS now

09:49 <daniels> it took a long time to shut down as Ceph spent all the time trying and failing to reach the other hosts

09:49 <bentiss> damn, the pull of the backup failed, and ceph in HEALTH_ERR

09:49 <daniels> hopefully it comes back to health after large-6 comes back and they become coherent again?

09:50 <bentiss> yep

09:50 <bentiss> right now all the large-5 osd are down too

09:52 <bentiss> sigh, we probably want to kick large-5 too

09:52 <daniels> :(

09:53 <daniels> large-6 is happy and healthy running pods now

09:53 <daniels> but they're all dying because postgres was on -5 :P

09:53 Nikky has joined #freedesktop

09:53 <bentiss> fdo-k3s-large-5 NotReady -> all the matching pods are terminating, is that you or k8s magic?

09:54 <daniels> not me!

09:54 <daniels> I've left -5 completely alone

09:54 <bentiss> k, so that's good to know that k8s can "heal"itself

09:55 <daniels> well, sort of ...

09:55 <bentiss> though I guess we should reboot the node because it can't remnove the pods on the nodes it can not contact

09:55 <daniels> the failure mode I've seen a couple of times now is that k3s declares the node unhealthy, but the agent stays in a weird halfway state

09:55 <daniels> right

09:55 <daniels> so they're stuck in Terminating until the node comes back up

09:55 <bentiss> k, rebooted large-5

09:56 <daniels> for whatever reason when this happens, k3s-agent is not totally inert but it does seem to completely lose the ability to communicate with the primary, and it also gets stuck with hanging containers

09:56 <daniels> a soft reboot does clean the containers up properly, give a few minutes for ceph to try and (often) fail to make things coherent, then at startup it's all happy again

09:57 <bentiss> k, waiting on this to happen then

09:57 <daniels> that's always (that I've seen) been correlated with a kernel soft-lockup in I/O (due to Ceph I guess) or an OOM kill, which I guess just leaves the wrong process unresponsive

09:57 Ai has joined #freedesktop

09:59 <bentiss> "machine restart" :)

10:00 <bentiss> daniels: BTW, I managed to trick gitaly regarding xrestop that was missing an object (by pushing it to a branch and then removing the branch)

10:01 <bentiss> the only one repo that is failing now (and that prevents the backup to run) is https://gitlab.freedesktop.org/JKRhb/MUD-Files

10:01 <bentiss> maybe we should just clear that one, and ask the owner to push

10:04 <daniels> woohoo, first webservice pod back

10:04 <bentiss> daniels: I am willing to move postgresql to an other node

10:05 <bentiss> and gitaly too FWIW

10:05 <bentiss> putting them on the server-*

10:06 <daniels> lol, I was wondering why they both suddenly just disappeared after having started successfully :P

10:06 <daniels> thanks

10:06 <bentiss> just in case the large* gets in limbo, that should prevent part of the 502, maybe

10:08 <daniels> ack

10:08 <daniels> let's find out :)

10:08 karthanistyr[m] has joined #freedesktop

10:12 Ai has quit []

10:12 Ai has joined #freedesktop

10:14 Nikky has quit []

10:14 Ai has quit []

10:14 Nikky has joined #freedesktop

10:15 <bentiss> daniels: what do we do about https://gitlab.freedesktop.org/JKRhb/MUD-Files ?

10:19 <daniels> bentiss: shrug, I think just email the author and say 'sorry'

10:19 <bentiss> sounds like a plan

10:22 <bentiss> k, I killed that project and restart the backup

10:23 <daniels> thankyou!

10:23 * daniels crosses everything

10:25 <bentiss> heh

13:54 <bentiss> daniels: FWIW, the backup manages to create the tar, and is now uploading, so fixing those 2 repos was helpful

14:22 <daniels> \o/

14:58 ximion has joined #freedesktop

16:25 thaller has joined #freedesktop

16:28 thaller_ has quit [Ping timeout: 480 seconds]

16:28 netbsduser`` has left #freedesktop [#freedesktop]

16:29 netbsduser has joined #freedesktop

16:29 <netbsduser> https://cgit.freedesktop.org/dbus/dbus/tree/tools/dbus-update-activation-environment.c#n98

16:30 <netbsduser> i'd liket o send a patch to get rid of these repeated inappropriate uses of "#ifdef __linux__" when it should really be chcecking for HAVE_SYSTEMD - or maybe not even that

16:32 <netbsduser> as a systemd fork, my InitWare package provides these interfaces on BSD platforms, so the assumption that systemd == linux is invalid now

16:46 immibis has quit [Remote host closed the connection]

16:49 immibis has joined #freedesktop

16:54 <alanc> netbsduser: this channel is mostly about the fd.o infrastructure, not the individual projects - I suspect dbus maintainers are more likely found on the GNOME irc server, but they've posted info on submitting changes at https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/CONTRIBUTING.md

17:33 bcarvalho__ has quit []

22:01 strugee has quit [Quit: ZNC - http://znc.in]

22:04 strugee has joined #freedesktop