Thymo has quit [Remote host closed the connection]
Thymo has joined #freedesktop
pendingchaos_ has quit [Ping timeout: 480 seconds]
pendingchaos has joined #freedesktop
Seirdy has quit [Ping timeout: 480 seconds]
Seirdy has joined #freedesktop
jarthur has joined #freedesktop
jarthur has quit [Ping timeout: 480 seconds]
ximion has quit []
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<bentiss>
daniels: FWIW, definitively, the 502 errors are happening when we are uploading/downloading the backups over ceph
<bentiss>
it was almost no 502 errors for the past few days, but I realized this morning the backups were faulting for the past 3 days
<bentiss>
I am pulling the last backup (from Sep 1), and 502 and OSD with slow response tiome are happening a lot
<daniels>
bentiss: I just noticed that k3s-large-6 is dead in the water, so I'm kicking it now
<bentiss>
daniels: OK, thanks
<daniels>
(k3s-agent fell down a hole and wasn't able to communicate anything to the primary, which I guess is not unrelated to the 300 loadavg and multiple OOM kills ...)
<bentiss>
daniels: I am doing the pull of the backup on large-5, please do not kick that one until it's done
<daniels>
bentiss: yep!
<daniels>
(interestingly large-6 is currently logging connection failure to mon[02], 'connect error -101' which is ENETUNREACH and I guess means ... WireGuard dead?)
<bentiss>
daniels: large-6 is still not booted
<daniels>
yeah, coming through BIOS now
<daniels>
it took a long time to shut down as Ceph spent all the time trying and failing to reach the other hosts
<bentiss>
damn, the pull of the backup failed, and ceph in HEALTH_ERR
<daniels>
hopefully it comes back to health after large-6 comes back and they become coherent again?
<bentiss>
yep
<bentiss>
right now all the large-5 osd are down too
<bentiss>
sigh, we probably want to kick large-5 too
<daniels>
:(
<daniels>
large-6 is happy and healthy running pods now
<daniels>
but they're all dying because postgres was on -5 :P
Nikky has joined #freedesktop
<bentiss>
fdo-k3s-large-5 NotReady -> all the matching pods are terminating, is that you or k8s magic?
<daniels>
not me!
<daniels>
I've left -5 completely alone
<bentiss>
k, so that's good to know that k8s can "heal"itself
<daniels>
well, sort of ...
<bentiss>
though I guess we should reboot the node because it can't remnove the pods on the nodes it can not contact
<daniels>
the failure mode I've seen a couple of times now is that k3s declares the node unhealthy, but the agent stays in a weird halfway state
<daniels>
right
<daniels>
so they're stuck in Terminating until the node comes back up
<bentiss>
k, rebooted large-5
<daniels>
for whatever reason when this happens, k3s-agent is not totally inert but it does seem to completely lose the ability to communicate with the primary, and it also gets stuck with hanging containers
<daniels>
a soft reboot does clean the containers up properly, give a few minutes for ceph to try and (often) fail to make things coherent, then at startup it's all happy again
<bentiss>
k, waiting on this to happen then
<daniels>
that's always (that I've seen) been correlated with a kernel soft-lockup in I/O (due to Ceph I guess) or an OOM kill, which I guess just leaves the wrong process unresponsive
Ai has joined #freedesktop
<bentiss>
"machine restart" :)
<bentiss>
daniels: BTW, I managed to trick gitaly regarding xrestop that was missing an object (by pushing it to a branch and then removing the branch)
<netbsduser>
i'd liket o send a patch to get rid of these repeated inappropriate uses of "#ifdef __linux__" when it should really be chcecking for HAVE_SYSTEMD - or maybe not even that
<netbsduser>
as a systemd fork, my InitWare package provides these interfaces on BSD platforms, so the assumption that systemd == linux is invalid now
immibis has quit [Remote host closed the connection]
immibis has joined #freedesktop
<alanc>
netbsduser: this channel is mostly about the fd.o infrastructure, not the individual projects - I suspect dbus maintainers are more likely found on the GNOME irc server, but they've posted info on submitting changes at https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/CONTRIBUTING.md