ChanServ changed the topic of #freedesktop to:
danvet has quit [Ping timeout: 480 seconds]
pendingchaos_ has joined #freedesktop
pendingchaos has quit [Ping timeout: 480 seconds]
Thymo has joined #freedesktop
Thymo_ has quit [Ping timeout: 480 seconds]
Thymo has quit [Remote host closed the connection]
Thymo has joined #freedesktop
pendingchaos_ has quit [Ping timeout: 480 seconds]
pendingchaos has joined #freedesktop
Seirdy has quit [Ping timeout: 480 seconds]
Seirdy has joined #freedesktop
jarthur has joined #freedesktop
jarthur has quit [Ping timeout: 480 seconds]
ximion has quit []
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<bentiss> daniels: FWIW, definitively, the 502 errors are happening when we are uploading/downloading the backups over ceph
<bentiss> it was almost no 502 errors for the past few days, but I realized this morning the backups were faulting for the past 3 days
<bentiss> I am pulling the last backup (from Sep 1), and 502 and OSD with slow response tiome are happening a lot
<daniels> bentiss: I just noticed that k3s-large-6 is dead in the water, so I'm kicking it now
<bentiss> daniels: OK, thanks
<daniels> (k3s-agent fell down a hole and wasn't able to communicate anything to the primary, which I guess is not unrelated to the 300 loadavg and multiple OOM kills ...)
<bentiss> daniels: I am doing the pull of the backup on large-5, please do not kick that one until it's done
<daniels> bentiss: yep!
<daniels> (interestingly large-6 is currently logging connection failure to mon[02], 'connect error -101' which is ENETUNREACH and I guess means ... WireGuard dead?)
<bentiss> daniels: large-6 is still not booted
<daniels> yeah, coming through BIOS now
<daniels> it took a long time to shut down as Ceph spent all the time trying and failing to reach the other hosts
<bentiss> damn, the pull of the backup failed, and ceph in HEALTH_ERR
<daniels> hopefully it comes back to health after large-6 comes back and they become coherent again?
<bentiss> yep
<bentiss> right now all the large-5 osd are down too
<bentiss> sigh, we probably want to kick large-5 too
<daniels> :(
<daniels> large-6 is happy and healthy running pods now
<daniels> but they're all dying because postgres was on -5 :P
Nikky has joined #freedesktop
<bentiss> fdo-k3s-large-5 NotReady -> all the matching pods are terminating, is that you or k8s magic?
<daniels> not me!
<daniels> I've left -5 completely alone
<bentiss> k, so that's good to know that k8s can "heal"itself
<daniels> well, sort of ...
<bentiss> though I guess we should reboot the node because it can't remnove the pods on the nodes it can not contact
<daniels> the failure mode I've seen a couple of times now is that k3s declares the node unhealthy, but the agent stays in a weird halfway state
<daniels> right
<daniels> so they're stuck in Terminating until the node comes back up
<bentiss> k, rebooted large-5
<daniels> for whatever reason when this happens, k3s-agent is not totally inert but it does seem to completely lose the ability to communicate with the primary, and it also gets stuck with hanging containers
<daniels> a soft reboot does clean the containers up properly, give a few minutes for ceph to try and (often) fail to make things coherent, then at startup it's all happy again
<bentiss> k, waiting on this to happen then
<daniels> that's always (that I've seen) been correlated with a kernel soft-lockup in I/O (due to Ceph I guess) or an OOM kill, which I guess just leaves the wrong process unresponsive
Ai has joined #freedesktop
<bentiss> "machine restart" :)
<bentiss> daniels: BTW, I managed to trick gitaly regarding xrestop that was missing an object (by pushing it to a branch and then removing the branch)
<bentiss> the only one repo that is failing now (and that prevents the backup to run) is https://gitlab.freedesktop.org/JKRhb/MUD-Files
<bentiss> maybe we should just clear that one, and ask the owner to push
<daniels> woohoo, first webservice pod back
<bentiss> daniels: I am willing to move postgresql to an other node
<bentiss> and gitaly too FWIW
<bentiss> putting them on the server-*
<daniels> lol, I was wondering why they both suddenly just disappeared after having started successfully :P
<daniels> thanks
<bentiss> just in case the large* gets in limbo, that should prevent part of the 502, maybe
<daniels> ack
<daniels> let's find out :)
karthanistyr[m] has joined #freedesktop
Ai has quit []
Ai has joined #freedesktop
Nikky has quit []
Ai has quit []
Nikky has joined #freedesktop
<bentiss> daniels: what do we do about https://gitlab.freedesktop.org/JKRhb/MUD-Files ?
<daniels> bentiss: shrug, I think just email the author and say 'sorry'
<bentiss> sounds like a plan
<bentiss> k, I killed that project and restart the backup
<daniels> thankyou!
* daniels crosses everything
<bentiss> heh
<bentiss> daniels: FWIW, the backup manages to create the tar, and is now uploading, so fixing those 2 repos was helpful
<daniels> \o/
ximion has joined #freedesktop
thaller has joined #freedesktop
thaller_ has quit [Ping timeout: 480 seconds]
netbsduser`` has left #freedesktop [#freedesktop]
netbsduser has joined #freedesktop
<netbsduser> i'd liket o send a patch to get rid of these repeated inappropriate uses of "#ifdef __linux__" when it should really be chcecking for HAVE_SYSTEMD - or maybe not even that
<netbsduser> as a systemd fork, my InitWare package provides these interfaces on BSD platforms, so the assumption that systemd == linux is invalid now
immibis has quit [Remote host closed the connection]
immibis has joined #freedesktop
<alanc> netbsduser: this channel is mostly about the fd.o infrastructure, not the individual projects - I suspect dbus maintainers are more likely found on the GNOME irc server, but they've posted info on submitting changes at https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/CONTRIBUTING.md
bcarvalho__ has quit []
strugee has quit [Quit: ZNC - http://znc.in]
strugee has joined #freedesktop