ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
ngcortes has joined #freedesktop
Leopold_ has quit [Remote host closed the connection]
Leopold__ has joined #freedesktop
genpaku has quit [Remote host closed the connection]
genpaku has joined #freedesktop
ybogdano has quit [Ping timeout: 480 seconds]
ngcortes has quit [Remote host closed the connection]
ximion has quit []
alanc has quit [Remote host closed the connection]
thaller has joined #freedesktop
alanc has joined #freedesktop
thaller has quit []
thaller has joined #freedesktop
<bentiss>
so. gitlab 15.4.1 is a security update, and I'm going to try applying it now. I'll take the oportunity to also move the db to the new DC
danvet has joined #freedesktop
<bentiss>
still having issues bringing back up the db
Leopold__ has quit [Remote host closed the connection]
itoral_ has joined #freedesktop
<bentiss>
alright, db back up. I had to put it on one of the server, not the agents :/
MajorBiscuit has joined #freedesktop
<bentiss>
the webservice pods are slowly re-attaching themselves
<bentiss>
OK, now time for the security upgrade
<bentiss>
reverting to the old deployment, the db is still not capable to be upgraded :(
vbenes has joined #freedesktop
mvlad has joined #freedesktop
<pq>
I'm getting a 500 on gitlab, FWIW. Reversion still on-going?
kj has joined #freedesktop
scrumplex_ has quit []
scrumplex has joined #freedesktop
rgallaispou has joined #freedesktop
rgallaispou has quit []
<bentiss>
I was reverting the failed upgrade, but it seems it broke gitlab...
<bentiss>
k, reenabled it, and we are good now
<bentiss>
I'll retry the 15.4.1 migration now. I should have solved the failed migration (and now I know how to solve it)
<mupuf>
bentiss: third time's the charm!@
<pq>
seems to work right now, thanks! :-)
<bentiss>
at least the migration went further this time
<bentiss>
even completed!
<bentiss>
everything but the gitaly pods managed to migrate to the new gitlab version
<bentiss>
and they are all up now :)
<daniels>
bentiss: that's weird, what happened with db migration?
<bentiss>
the thing is the tables were changed, but the migration was not marked completed
<bentiss>
whenever I tried to restart the migration with the normal upgrade through helm the migration pod would fail because it tried to insert a column that was already there
<bentiss>
the solution was: 1. revert the deployment
<bentiss>
2. edit the toolbox deployment so it has the next gitlab release
<bentiss>
3. "gitlab-rake gitlab:db:mark_migration_complete[20220406193806]" to mark the migration as complete
<bentiss>
4. gitlab-rake db:migrate:down VERSION=20220406193806 -> that made the 500 ion gitlab
<bentiss>
5. gitlab-rake db:migrate:up VERSION=20220406193806 -> solved the 500 and the failing migration
<bentiss>
6. redo the helm upgrade, and this time the migration went through
<bentiss>
daniels: also we have to monitor the new large-* machines, they sometimes have too many opened files, and kubelet is failing in weird state. I have 2 fixes: a temporary with "sysctl fs.inotify.max_user_instances=512" and "systemctl reboot"
<bentiss>
(I blame the ceph migration for now)
<bentiss>
besides that, the machines are working well, but kilo is not showing up the peers in the new servers, so that means we have to keep around server-2 for now
<daniels>
bentiss: aha wow, thanks a lot for the migration tip
* bentiss
learned it the hard way :)
<daniels>
hehe
<daniels>
I've never had one fail on me before!
<bentiss>
the last thing we have to take care also is that there are still a few places where we force the pods to be on s1.large, so they will fail to spin up now if we kill them
<bentiss>
the ceph-s3 pod is one of them (easy enough to fix but I'd like to upgrade rook too to be able to be k8s 1.25 compliant)
<bentiss>
and teh gitlab toolbox, because we need local disks
rgallaispou has joined #freedesktop
rgallaispou has quit []
<bentiss>
the toolbox is slightly more worrying. I haven't partitioned correctly the s3.xlarge, and there is no spare disk available on them, so we need to reorganize those data on one server at least
rgallaispou has joined #freedesktop
<bentiss>
I think that's where I am now. Next step would be to also migrate minio-packet.fd.o, but that requires some more deployments :(
<daniels>
local disk for toolbox being for backup?
<bentiss>
daniels: yes, we need to have space for it to fetch all repos and then gzip them
<daniels>
yep
<bentiss>
and local is way better than distributed for that :)
<bentiss>
FWIW, I'll go AFK a bit, the data still hasn't fully migrated yet (only 6% left), so there is not much I can do now on that
<daniels>
hehe yeah, backups over ceph would be a bit hostile ...
* daniels
nods, thanks
<daniels>
I have to go afk for a while too tbf
fahien has joined #freedesktop
fahien has quit []
AbleBacon has quit [Read error: Connection reset by peer]
fahien has joined #freedesktop
Guest1750 is now known as frytaped
thaller is now known as Guest1848
thaller has joined #freedesktop
Guest1848 has quit [Ping timeout: 480 seconds]
rgallaispou has quit [Read error: Connection reset by peer]
itoral_ has quit []
Kayden has quit [Read error: Connection reset by peer]
Kayden has joined #freedesktop
chipxxx has joined #freedesktop
fahien has quit [Ping timeout: 480 seconds]
Leopold_ has joined #freedesktop
fahien has joined #freedesktop
GNUtoo has quit [Remote host closed the connection]
GNUtoo has joined #freedesktop
ximion has joined #freedesktop
Haaninjo has joined #freedesktop
Leopold___ has joined #freedesktop
Leopold___ has quit [Remote host closed the connection]
<eric_engestrom>
^ that's when uploading artifacts, to be more specific
<eric_engestrom>
bentiss: ^
<glehmann>
I'm getting 500s when using gitlab search
alatiera has quit [Quit: Ping timeout (120 seconds)]
alatiera has joined #freedesktop
alyssa has joined #freedesktop
<alyssa>
getting error 500 commenting on MRs, is something up?
<bentiss>
all the pods are up, so I wonder what is happening there
<alyssa>
gitlab.exceptions.GitlabCreateError: 500: 500 Internal Server Error
<alyssa>
(from python)
<alyssa>
Your comment could not be submitted! Please check your network connection and try again.
<alyssa>
(from the web ui)
<MTCoster>
Just ran into the same issue creating an MR
<bentiss>
postgres: HINT: Check free disk space.
<alyssa>
uh oh
<bentiss>
sigh, now we are running out of space :)
<bentiss>
give me a sec
<alyssa>
have you tried turning it off and on again
<zmike>
no, definitely don't turn it off
<bentiss>
I can just request the disk to big bigger :)
<bentiss>
which I did
rkanwal has joined #freedesktop
<bentiss>
now we need to wait for kubernetes magic to happen
<alyssa>
download free disk space
<bentiss>
that's roughly it :)
<bentiss>
and done 30 GB more :)
<bentiss>
ok, now I think I can go on weekend
<alyssa>
it works :sparkles:
<alyssa>
thank u
<bentiss>
sigh... when I did the first migration, the postgres disk was 50GB. I just bumped it at 200GB, in just a little bit 2 years time :(
<alyssa>
150 GB in 2 years produced by hundreds of full time engineers, and I assume GitLab's schema is not super dense ... doesn't seem so strange
<alyssa>
especially if we're keeping around all the random apitraces and videos that get attached to issues
<daniels>
uploads are separate, we don't store all that in psql
<daniels>
I'd put my money on build logs
<zmike>
so you're saying it's my fault
<daniels>
yeah I mean fdo was doing great until you came on the scene
<bentiss>
definitely ci logs, they are all stored in plain text in the db
<bentiss>
though it's weird, I thought they were in the S3 storage... but I know there is something related to ci jobs in the db that takes all the space
<daniels>
iirc they get put as a whole into artifact storage, but then they also get split into their sections and all those are shoved into psql
<alyssa>
aren't build logs rotated after 30 days anyway?
<alyssa>
zmike: sorry zink has to go
* zmike
updates his presentation
<daniels>
alyssa: yep ...
chipxxx has quit [Read error: Connection reset by peer]
<alyssa>
zmike: think of how many times the word "zink" appears in build logs
<alyssa>
that's so many bytes
* zmike
counts the letters in panfrost and multiplies it by the number of panfrost jobs
<alyssa>
how dare u
<alyssa>
well what about uhhh
* alyssa
sweats
<alyssa>
freedreno
<daniels>
we execute ~12k builds per day fwiw
<DavidHeidelberg[m]>
"how dare u" ‒ Greta, after seeing Mesa3D CI farm...
rkanwal has quit [Quit: rkanwal]
<alyssa>
same
<alyssa>
that and XDC in person
jstein has joined #freedesktop
ybogdano has joined #freedesktop
<DragoonAethis>
discussions like these make me happy the Intel CI farm is behind 7 firewalls and nobody can see the state of that
Leopold___ has quit [Write error: connection closed]
MajorBiscuit has quit [Quit: WeeChat 3.5]
mvlad has quit [Remote host closed the connection]
ngcortes has joined #freedesktop
jstein has quit [Ping timeout: 480 seconds]
fahien has quit [Quit: fahien]
mattst88 has joined #freedesktop
<mattst88>
o/
<mattst88>
I've been moderating the mesa-announce@ list for years, and recently there's been a huge uptick in spam
<mattst88>
it's almost all from garbage TLDs like .lol and .click
<mattst88>
.quest, .art, .icu
<mattst88>
it's been interesting to learn that so many stupid TLDs exist, but is there any way we can more effectively block these do I don't have to discard so many messages every day?
<mattst88>
in a similar vein, I stopped receiving mail from xorg-announce@ in July (last message received was [ANNOUNCE] xf86-video-nv 2.1.22 on July 27)
<mattst88>
any idea why? Is gmail just dropping these messages now? I don't even see them in spam
alyssa has left #freedesktop [#freedesktop]
ngcortes has quit [Remote host closed the connection]
AbleBacon has joined #freedesktop
<daniels>
yeah gmail is very skeptical about our messages
<alanc>
there was a suggestion a while back that it was something to do with our reply-to header or something like that, but I forgot what it was
<alanc>
oh, right: probably because it adds a Reply-To: header, which breaks the DKIM signature
anholt_ has quit [Remote host closed the connection]
anholt has joined #freedesktop
Haaninjo has quit [Quit: Ex-Chat]
ngcortes has joined #freedesktop
<Mithrandir>
gmail recently started enforcing SPF a lot more strictly too, but that should just look at envelope sender, so _shouldn't_ matter?
jstein has joined #freedesktop
jstein has quit []
<zmike>
is there a reason why all my MRs say "Merge blocked: pipeline must succeed. It's waiting for a manual action to continue." ?