daniels changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
jokester1365 has quit [Remote host closed the connection]
privacy has quit [Quit: Leaving]
manuels2 has quit [Quit: Ping timeout (120 seconds)]
manuels2 has joined #freedesktop
bmodem has joined #freedesktop
Juest is now known as Guest8535
Juest has joined #freedesktop
Guest8535 has quit [Ping timeout: 480 seconds]
tlwoerner has quit [Remote host closed the connection]
tlwoerner has joined #freedesktop
Leopold_ has joined #freedesktop
Shibe has quit [Remote host closed the connection]
vx has quit [Quit: G-Line: User has been permanently banned from this network.]
vx has joined #freedesktop
Guest8266 is now known as DrNick
<bilboed>
IIIUC the red step, the DB already had the data ??
AbleBacon has quit [Read error: Connection reset by peer]
alice has quit [Ping timeout: 480 seconds]
Leopold_ has quit [Remote host closed the connection]
<bentiss>
bilboed: nah, the new config confused the script: it thought it had the data when it hadn't
Haaninjo has joined #freedesktop
<bilboed>
Ok. Looks like things are coming along nicely
<bentiss>
yeah, I'm going to restore the service and run a couple of tests. It might have to revert to the last backup, so please don't start working on it
<bilboed>
ποΈ
samuelig has quit []
samuelig has joined #freedesktop
<bentiss>
hmm... the gitlab webservice pods keep crashing
<pq>
It feels like holiday while gitlab is down... no rush :-)
<bentiss>
I have some errors in the db, running reindex on it
hec2233 has joined #freedesktop
hec2233 has quit []
ehfd[m] has joined #freedesktop
smpl has joined #freedesktop
jani has quit []
jani has joined #freedesktop
<svuorela>
bentiss: done ?
<bentiss>
svuorela: nope :(
<bentiss>
the webservice are still crashing
<svuorela>
oh. Just got actual pages when accidentally reloading a tab.
<bentiss>
yeah, it keeps getting up and done
mvlad has joined #freedesktop
<bentiss>
I've restarted the 2 db and migrated them on different nodes, that seems better
<bentiss>
not really...
Shibe has joined #freedesktop
Haaninjo has quit [Quit: Ex-Chat]
<bentiss>
daniels: I can't seem to make the webservice pods stay up
<bentiss>
attempting to restart gitaly pods, as they might still be confused since teh db split (I haven't restarted them since)
<daniels>
bentiss: Iβm just coming to work (very slowly; turns out walking on a broken tire isnβt much fun) so can have a look in a little bit
<bentiss>
daniels: ok, thanks. it's about lunch time here...
<bentiss>
daniels: long story short: the webservice pods keep getting down and up (from 1/2 to 2/2 to 1/2)
<bentiss>
daniels: it could come from the db or the webservice configuration, not sure
<bentiss>
daniels: one solution would be to redeploy today's backup (it takes ~5h), disable db split and see if that changes anything
<bentiss>
we can do that by keeping the current db in the back (we keep the pv around, and just recreate the pvc)
<bentiss>
daniels: also as soon as I disable ingress towards the webservice pods, they are back up. So it's load related
<bentiss>
I've tested spinning up and down the number of replicas, but same result
<bentiss>
and of course, nothing interesting in the logs I could find
<bentiss>
let me disable webservice :(
<bentiss>
I'm going to restore the db dump before the split. We'll see if that helps
<daniels>
bentiss: hmmm ok, I was going to go stare at the logs (wow that walk was really slow) - did you see anything?
<bentiss>
not much no...
<bentiss>
daniels: I've kept the pv being the db split. So once that restore ends, we can go live with one or the other
<daniels>
nice, thanks a lot
<bentiss>
the db is restoring (manually with pg_restore, this is faster), and I'm going afk a bit
<bentiss>
(lunch)
<daniels>
bon appetit :)
vkareh has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
guludo has joined #freedesktop
<ndufresne>
Was this expected to take that long ?
<karolherbst>
the downtime was planed to be 48 hours
<ndufresne>
ok, so far so good, is there a ML I perhaps should signing ?
<karolherbst>
it was communicated via banners on gitlab
<ndufresne>
ah, I should have checked the day of the week, I always assume this is happening over the weekend
<karolherbst>
but maybe it makes sense to add an ETA on the maintenance page in the future as well?
<karolherbst>
bentiss: ^^
<karolherbst>
or rather.. the planned ETA
<ehfd[m]>
karolherbst: I think it's pretty unpredictable.
<karolherbst>
sure, but the point is to give drive-by people some kind of answer
<zmike>
oh no we hit a goto 80%
lsd|2 has joined #freedesktop
teemperor has joined #freedesktop
<teemperor>
question: is there an ETA for the gitlab upgrade? it seems to be stuck since a while
lsd|2 has quit []
pocek[m] has joined #freedesktop
<bentiss>
karolherbst: OK, added the planned window outage time
<karolherbst>
thanks :)
<bentiss>
teemperor: no ETA no. Things did not went as well as expected. We have to revert part of the changes, but if the problem is different, that might require another 5h of db restore
<bl4ckb0ne>
are you still going with the db split?
<bentiss>
I'm going to try if the webservice pods are working better without the split, and take a decision after
<bentiss>
but it's unlikely we will be able to re-split given the time left
<karolherbst>
:'(
<karolherbst>
well at least you have backups and could try locally to see what's the problem or something
<karolherbst>
(after getting the system live again)
<bentiss>
karolherbst: if the problems are unrelated to the db split and we fix them I can always revert to the splitted db
<karolherbst>
fair enough
<bentiss>
that takes a couple of seconds
lsd|2 has joined #freedesktop
<bentiss>
but the moving target where new postgresql, upgrade of the bitnami chart (arguably only a minor upgrade), and db split
<karolherbst>
yeah...
<bentiss>
my bets are right now on the db split, but maybe it's a problem with the new chart or postgres
<karolherbst>
maybe those things should be done individually even if that means down time more often
<bl4ckb0ne>
thanks! fingers crossed
<bentiss>
daniels: db restore done (in theory)
<bentiss>
so far, it's holding the load
<karolherbst>
that's now without the split db, right?
<bentiss>
yeah
<karolherbst>
kinda sad, but I guess next time
<zmike>
:(
<ehfd[m]>
Probably that GitLab didn't harden the split DB scheme enough...
<bentiss>
thought one difference is that this time I ran the migration job (because I forgot to opt out)
<bentiss>
but yeah, maybe it's better to keep it that way
<karolherbst>
though it did error when you tried to do the split, no?
<bentiss>
yeah, but it was related to my config change, I went too far in the config because I wanted to have 2 separate postgres, not just one
<bentiss>
so I setup the change first and then when I started the process it happily told me that the config was already there so it should stop
<daniels>
so the migrate job was designed for having split dbs but on the same host?
<bentiss>
while the data wasn't
<bentiss>
yeah, so far, yah, same host
<bentiss>
*yeah
<bentiss>
daniels: happy to keep it that way?
<bentiss>
it feels so much smoother now that it's not timing out every request :)
<karolherbst>
heh
<karolherbst>
wait until people use it
<daniels>
hahaha
<daniels>
yeah, I mean, working > not working for sure
<daniels>
I guess before we try again we should probably try the backups with the adjusted config on a shadow host to figure out wtf is making the webservice die?
<zmike>
so maintenance is done?
<bentiss>
yeah, good point. I can deploy a pvc with the old main db (splitted) so it's not lost
<bentiss>
zmike: I need to remember if all the pieces are together
<bentiss>
and I had a short night :)
<zmike>
π
<zmike>
just want to make sure it's done before I start using it
<karolherbst>
marge is already busy π
<karolherbst>
bentiss: maybe marged triggering CI stuff broke it
<karolherbst>
*marge
<bentiss>
nah, it wasn't even able to start because it failed at pulling its container
<karolherbst>
I see
<daniels>
bentiss: errr
<daniels>
oh, Marge, right :)
<daniels>
not webservice, heh
<bentiss>
yeah webservice was starting but timing out a lot, and then kubelet restarted it every 30 secs
<daniels>
hm, not answering to /ready ?
<bentiss>
at least, we upgraded postgres and gitlab, so that's still positive :)
<daniels>
yeah :)
<bentiss>
I extended the timout from 2 to 20 secs, and it was still randomly shooting at pods
<bentiss>
though 20s was terrible
<daniels>
also, I wonder, instead of doing pg_dump to pull the db content, would it be safe to snapshot the PV?
<bentiss>
we can not upgrade the db this way
<daniels>
yeah, not across major versions
<bentiss>
between major postgres, we need to reinstall all of the data
<daniels>
right
<daniels>
I meant outside of the major upgrades, like if we want to try db split tests again
<bentiss>
that's weird, I don't see the postgres errors I was having after the db split
<daniels>
snapshot PV -> provision new PV with snapshot -> test migration from there
<bentiss>
yeah
<daniels>
might save a few hours relative to pg_dump/pg_restore?
<bentiss>
we can also rely on the latest daily backup
<bentiss>
still takes time to provision though
<bentiss>
anyway, something for next time
<bentiss>
... and banner is gone
<DragoonAethis>
\o/
<karolherbst>
nice
<daniels>
ack
<daniels>
thanks!
<bl4ckb0ne>
thanks o/
<karolherbst>
we celebrate by DDOSing gitlab through everybody pushing the work they've done since Monday π
* DragoonAethis
flips 10 heavy CI jobs back on
<bentiss>
go ahead!
<karolherbst>
oh no
<karolherbst>
it's alreayd slowing down
* bentiss
doesn't think so
<karolherbst>
mhh, yeah, seems to be still quicker than before actually
teemperor has quit [Remote host closed the connection]
<dwfreed>
they even used it themselves in 2020 when upgrading gitlab.com's postgres
<bentiss>
one benefit of backup/restore is that it cleans the db a lot and it shrinks the overall size
<bentiss>
not sure we've got a vacuum enabled and running properly
<bentiss>
anyway if anybody feels like they want to have a look, we welcome new admins :)
<dwfreed>
I know nothing about gitlab
<dwfreed>
Not sure I want to learn, either
<daniels>
it's a lab for git. any questions?
bmodem has quit [Ping timeout: 480 seconds]
<DodoGTA>
So is the GitLab database currently split (or not)?
<daniels>
it's not currently split
AbleBacon has quit [Read error: Connection reset by peer]
alice has joined #freedesktop
dcunit3d has quit [Quit: Quitted]
dcunit3d has joined #freedesktop
dcunit3d has quit [Remote host closed the connection]
alice_ has joined #freedesktop
alice has quit [Ping timeout: 480 seconds]
ximion has joined #freedesktop
dcunit3d has joined #freedesktop
pboushy has joined #freedesktop
lsd|2 has joined #freedesktop
pboushy has quit [Remote host closed the connection]
pboushy has joined #freedesktop
<kusma>
It seems to me like gitlab no longer sends out email notification, could that be?
<daniels>
no
<daniels>
I'm still getting them
<kusma>
Hmm, strange. I got one at 16:38 CEST, and then silence. And going through the activity, I see new MRs that I should have been notified about...
<daniels>
I've got them as recently as a couple of minutes ago
<daniels>
it is very possible they're not getting delivered to _you_, but yeah, gmail is pretty capricious
<kusma>
Hmm. Seems a subscription I thought I had was missing.
<kusma>
That seems to pre-date the upgrade, so probably PEBCAK
pboushy has quit [Remote host closed the connection]
pboushy has joined #freedesktop
pboushy has quit [Remote host closed the connection]
mripard has quit [Remote host closed the connection]
konstantin_ has joined #freedesktop
konstantin is now known as Guest8614
konstantin_ is now known as konstantin
Guest8614 has quit [Ping timeout: 480 seconds]
Haaninjo has joined #freedesktop
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
alice_ is now known as alice
mvlad has quit [Remote host closed the connection]