ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
Guest11598 has quit [Ping timeout: 480 seconds]
krastevm has joined #freedesktop
haaninjo has quit [Quit: Ex-Chat]
martink has quit [Ping timeout: 480 seconds]
blu has joined #freedesktop
krastevm has quit [Ping timeout: 480 seconds]
gachikuku has joined #freedesktop
scrumplex has joined #freedesktop
strugee has quit [Quit: ZNC - http://znc.in]
scrumplex_ has quit [Ping timeout: 480 seconds]
strugee has joined #freedesktop
kem has left #freedesktop [Leaving]
JanC is now known as Guest11606
JanC has joined #freedesktop
Guest11606 has quit [Ping timeout: 480 seconds]
kode54 has quit [Quit: The Lounge - https://thelounge.chat]
kode54 has joined #freedesktop
ximion1 has quit [Remote host closed the connection]
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
qaqland has joined #freedesktop
eluks has quit [Remote host closed the connection]
eluks has joined #freedesktop
swatish2 has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
swatish2 has joined #freedesktop
gnuiyl has quit [Remote host closed the connection]
gnuiyl has joined #freedesktop
sghuge has quit [Remote host closed the connection]
sghuge has joined #freedesktop
mvlad has joined #freedesktop
sima has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
jsa1 has joined #freedesktop
swatish2 has joined #freedesktop
tzimmermann has joined #freedesktop
mripard has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
ximion has joined #freedesktop
swatish2 has joined #freedesktop
ximion has quit [Remote host closed the connection]
karolherbst8 has joined #freedesktop
karolherbst has quit [Read error: Connection reset by peer]
AbleBacon has quit [Read error: Connection reset by peer]
<bentiss> daniels, MrCooper or anyone using ci-stats-grafana.fd.o: I've migrated the ci-stats namespace, can someone check that it's correct?
<bentiss> (of course, singing in with gitlab won't work
<bentiss> )
<MrCooper> haven't really used grafana in a long time
<bentiss> at least from the public view I can still see the data :)
<bentiss> looks like shutting down ci-stats-influxdb2 on equinix made everything disappear :)
<bentiss> DNS issue I would say
<bentiss> there we go
JerryXiao has quit [Remote host closed the connection]
JerryXiao has joined #freedesktop
karolherbst8 has quit []
karolherbst has joined #freedesktop
haaninjo has joined #freedesktop
haaninjo has quit [Quit: Ex-Chat]
qaqland has quit [Remote host closed the connection]
qaqland has joined #freedesktop
<mupuf> bentiss: What's the plan for re-enabling gitlab? Do you want to to bring back the web UI, then the runners, then marge last once the setup has been proven to work?
<mupuf> Do you want to wait for fastly first?
fomys_ has joined #freedesktop
<daniels> bentiss: it's looking good, thanks!
<bentiss> daniels: great!
<bentiss> mupuf: right now I'm still pulling the artifacts. 901 GB so far :(
<bentiss> I messed up that step in the preparation, I didn't realized the process was killed because ENOMEM instead of it being finished
<bentiss> so it's kind of the first one to come wins, but I really hope to have fastly upgraded tonight, so we can start use it
<bentiss> and we can not bring up the rest of the services because most of them make use of gitalb as oidc :(
guludo has joined #freedesktop
<mupuf> bentiss: oh, 901GB out of?
<daniels> mupuf: lol.
<mupuf> daniels: That's a nice oopsie, indeed :D
swatish2 has quit [Ping timeout: 480 seconds]
<bentiss> mupuf: artifacts and job logs
<mupuf> bentiss: Sure, but I meant how much is there to copy?
<bentiss> mupuf: good question, I have no idea
<mupuf> ha ha, ok
<bentiss> I have a rough idea of how much artifacts we have, but knowing which ones are more recent than a year is almost impossible
<mupuf> ok, and how much artifact do we have as a whole? That would be an upper bound
<bentiss> all I know is that before that run which lasted for the past 1d8h30m, I had ~408000 files in the bucket. This run checked ~395000 of those, so this is an indication we are getting closer to the end. But nothing is guaranteed
<bentiss> mupuf: 25 TB
<bentiss> we are at 1274552 files total, mesa over the past year made 27385 pipelines. Not sure how much jobs a mesa pipelines has, but that should give an idea of how much files we need to have at least
<bentiss> we can also decide to just drop the ball, and ignore the artifacts that are not pulled and I re-enable gitlab right now so we can continue on the migration of the services
<mupuf> bentiss: did you upload the artifacts from newest to oldest?
<bentiss> mupuf: I can't control that
<mupuf> ack
<mupuf> I would prefer waiting for the artifacts to be back before re-enabling the runners... but not everything is under your control
<mupuf> so, if we re-enable gitlab, we may have some jobs that will start running and fail due to missing artifacts (not the end of the world, really). Should we fear something worse than that?
guludo has quit [Ping timeout: 480 seconds]
<bentiss> mupuf: no jobs where running when I stopped it
<daniels> mupuf: that'd only affect jobs which need to pull artifacts from old pipelines, which is very few of them - I think the biggest issue would be needing to re-run pipelines for everything hosting pages
<bentiss> daniels: pages are on a different bucket
<bentiss> so they should eb fine
<bentiss> the sidekiq jobs takes the artifact and uploads it to pages
<mupuf> bentiss: ack, then I guess no need to wait for all the artifacts then. When you have nothing else you can do, then I would vote for you to re-enable gitlab without fastly
TrinitronX is now known as Guest11641
<mupuf> Then we need to add a banner to link to the migration page and say that CI is still disabled
TrinitronX has joined #freedesktop
<mupuf> and that we recommend against merging MRs
Guest11641 has quit [Ping timeout: 480 seconds]
<bentiss> daniels: ^^ your opinion on this?
* bentiss grumbles a bit because he shot down the regular ingress to only accept fastly in the cluster :/
<mupuf> hehe. What's this fastly upgrade you are waiting for?
<mupuf> is that so that we don't pay?
<daniels> I think for Fastly we should wait a bit later into US time to see if Karen is able to sign; for the artifacts I'm completely fine dropping the old ones for now and focusing on the registry instead
<daniels> but I don't think there's much sense to bring it up now when people can't use it for most things
<bentiss> it would be useful for external projects making use of ci-templates (gnome and red hat gitlab)
<mupuf> bentiss: we can't make the whole instance read only, right?
<mupuf> that would be perfect
<bentiss> for the registry, we need fastly, so I'm thinking we should just wait for the account, and then start migrating in the background
<bentiss> mupuf: it's a pain in the ass to do
<mupuf> then nevermind
<mupuf> daniels: any luck with the runners?
<daniels> mupuf: working on that today
<daniels> bentiss: are they needing the registry image too, or just the repo?
<bentiss> daniels: just the repo normally, we push the externally used images to quay
* mupuf should consider doing the same for ci-tron :D
swatish2 has joined #freedesktop
<mupuf> daniels: so, you would vote for gitlab to be back up only when all the services are ready, right?
<mupuf> So, the TODO list would be: artifacts, registry, runners?
<mupuf> I am voting for: when users can interact with gitlab, but CI is still down (TODO: registry, runners. Artifacts migration can be finished in the background as far as I am concerned)
<bentiss> nah, registry can be done in the background. Moving 10~12 TB of images will not happen overnight.
<mupuf> but we can't re-enable CI until the registry is migrated, can we?
<bentiss> and same for runners: to be able to test them, we need gitlab to be up, so we can rely on the equinix ones for a week or two
<bentiss> the registry doesn't depend on gitlab, it's a separate item
<bentiss> the registry uses oidc AFAIU, so as long as gitlab.fd.o is available, you can pull/push data even if they are not colocated in the same dc
<bentiss> (I might be wrong)
<bentiss> but I think it would make sense to bring things back up with a big notice that it's still not done, and that things might not be happy until the rest of the week
* mupuf agrees with that. Just need to figure out a concise way of what we want to communicate
<bentiss> "Migration is not done, please except hiccups and unnoticed shut down in case tests are not good. See maintenance.gitlab.freedesktop.org for the tracker of te rest"
<bentiss> *expect
<mupuf> git: OK. Comments: OK. Artifacts: partial. CI: WIP (registry: ???. runners: ???)
<bentiss> something along those lines
<bentiss> runners are still good, they haven't migrated yet
<mupuf> can we say that we do not expect data loss?
<mupuf> bentiss: allegedly still good. I'm sure there will be so fun there.
<bentiss> data loss, as long as you don't look for artifacts, this should be good
<mupuf> right, but I meant: stuff that you comment on or push is unlikely to get removed in a rollback
<mupuf> We could phrase it like: feel free to push trees, and write comments
<mupuf> actually, let's start simple: Consider gitlab as read only until services are tested and marked as good
<mupuf> then we add in maintenance a user-centric feature list: whatever we consider solid-enough for production use, we add a green tick
<mupuf> or we update the banner, but it would make it quite big
<bentiss> mupuf: last time we split the db, put back in prod and it was not good. So I won't guarantee a "stuff that you comment on or push is unlikely to get removed in a rollback"
<bentiss> mmaking it considered as read only is better
<hakzsam> you could also wait another day before re-enabling it (if you have doubts). People are expecting one week off anyways
<mupuf> hakzsam: waiting won't help us know
<mupuf> bentiss: would it be simple for you to add an htaccess on gitlab.freedesktop.org before re-enabling it?
<bentiss> anyway, my rclone sync process is getting killed
<hakzsam> mupuf: fair enough :)
<mupuf> this way we could at least test it a bit before opening the flood gates
<emersion> i'd also be in favor of bringing gitlab back up even if CI runners are off
<emersion> (read-only is useful, issues are useful, and some projects use their own runners)
<bentiss> but the runners *are* working, it's just that they are not in the correct place :)
swatish2 has quit [Ping timeout: 480 seconds]
<MrCooper> can you suspend/disable them until they're in the right place?
<bentiss> sure
<bentiss> but this will not prevent custom runners from personal projects
<MrCooper> seems fine?
<mupuf> yeah, let's disable the runners for now
<mupuf> custom runners, whatever ;)
<bentiss> for the aventurous people, use 138.199.132.39 in your /etc/host as gitlab.freedesktop.org and report, please
<bilboed> really minor detail : The reverse dns isn't configured properly
<bilboed> (i.e. pinging ex : ssh.gitlab.freedesktop.org returns the default your-server.de hostname)
<bentiss> bilboed: the dns isn't configured *at all*
<MrCooper> successfully logged in and loaded a couple of issues
<bilboed> lol, was already logged in it seems, retained sessions
<bilboed> did some git fetch, worked fine
<bilboed> looking through issues/mr also seems fine
<MrCooper> (it's very snappy, if only there were always this few users contending for resources :)
guludo has joined #freedesktop
<bentiss> damn, permission denied on some gitaly pods
<bilboed> ah, pipeline traces seem to be gone
swatish2 has joined #freedesktop
<bilboed> nvm, was reading in the wrong place
<bentiss> yeah pipeline traces is the biggest issue
<bilboed> traces are on s3 I imagine ?
<bilboed> (some are present, some aren't)
<bentiss> bilboed: yeah, that's the thing I was trying to pull for the past couple of days that just blew up
<bilboed> makes sense. Everything else seems fine (even checked user information, blames, etc...)
<bilboed> spoke too quickly, doesn't seem to be able to load user activity (spinner for ever)
<bilboed> oh, loads after a page refresh
<bentiss> we need to wait a bit before opening this up: ~11000 background jobs in the queue
<bilboed> OUCH :D
<bentiss> at least it's getting down (slowly)
* bentiss starts a bigger number of sidekiq pods, now that we have more room
<bentiss> though they do not seem to spin up :(
<dwt> All looks ok to me, signed in via google acct, sign-in security notification email worked, git-over-ssh works
swatish2 has quit [Ping timeout: 480 seconds]
<jrayhawk> Why did the apache2 service on annarchy get deactivated?
<bilboed> aaaah, the ssh-git address changed
<bentiss> bilboed: ATM both are still working
<bilboed> right
<bilboed> hmm... we'll need some big fat warnings everywhere regarding that
<bentiss> yeah, this will be required for fastly
<bentiss> we can't use port 22 for them
<bentiss> https://gitlab.freedesktop.org/mesa/mesa/container_registry -> I guess we need to sort out the registry somehow
<daniels> mm, and I wonder if it's worth using the window to move to the new one?
<bentiss> daniels: we are already on the new registry
<daniels> hm, I wonder why it's throwing banners about moving
<bentiss> we have online gc running for more than a year now
<bentiss> because it's dumb :)
<bentiss> we are definitely having issues connecting to the db
<bentiss> the pods are stuck in connecting
<bentiss> if any admin wants to change the new banner, feel free to do so
<mupuf> bentiss: it's looking ok
<bentiss> so... it seems the laod balancer is not happy with having the leader as one of the load balancer
<bentiss> which means switching the db leader is going to be fun
<bentiss> DNS as been updated to point at hetzner
<mupuf> bentiss: \o/
<mupuf> git seems to work well too
<bentiss> but now the maintenance page has wrong DNS entry :)
<bentiss> hmm I am pretty sure I bumped teh number of allowed connections to postgres
<mupuf> we are using a self hosted postgres, or a managed one?
<bentiss> self hosted, but on 3 bare metal
<bentiss> (dedicated)
<mupuf> ack
<mupuf> thanks
<karolherbst> it's so fast...
<karolherbst> I hope it stays that way
<bentiss> currently I'm getting too many concurrent accesses to the db, but I think I found the issue
swatish2 has joined #freedesktop
<bentiss> and that fixed it
<karolherbst> oh no.. it seems like it's not fast anymore :'(
<karolherbst> maybe I tried at the time where the bots didn't start hammering
hellfire7734club[m] has joined #freedesktop
<mupuf> karolherbst: seems fine here
yusmatvei25 has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
<karolherbst> mhhh looks like initial connection is slow
<karolherbst> but yeah.. comments are loading real quickly
TrinitronX is now known as Guest11653
TrinitronX has joined #freedesktop
Guest11653 has quit [Ping timeout: 480 seconds]
<bilboed> getting 502s
<mupuf> probably too few DB connections again
<bentiss> nah, I've changed the settings, and now the connections are OK, the 502 means kubernetes is killing the pods because the readyness fails
<bentiss> and not sure why this is happening
georgc has quit [Quit: Leaving]
gchini has joined #freedesktop
agd5f_ has quit []
agd5f has joined #freedesktop
<bentiss> could be that we are getting hammered
<bilboed> ddos already ? :)
<bilboed> oh, didn't realize you updated the DNS. That would make sense indeed
<bilboed> yah, even a non-complex page (like /help/) takes forever
<bentiss> it could also be that the registry being non there make the internal requests stall too much
<bilboed> fwiw, not logged in seems to response quickly
<bilboed> (ish)
mripard has quit [Quit: WeeChat 4.5.1]
<bentiss> I've ended up simply disabling the liveness/readiness checks from kubernetes, and this seems muchbetter
swatish2 has joined #freedesktop
<bentiss> I have a strong feeling not being able to ping the registry is not helping gitlab to return healthy
<bentiss> which means: we need to get the registry on Hetzner ASAP
<bentiss> sigh... doesn't work
<mupuf> bentiss: oddly enough, my runners have access to the registry
* mupuf will disable his scheduled pipelines
<bentiss> mupuf the registry is still hosted on equinix, and the DNS points at it correctly. The problem is when gitlab tries to access it it used an internal name. I've fixed it to use the public DNS entry, we'll see if that helps
<mupuf> let's cross fingers
swatish2 has quit [Ping timeout: 480 seconds]
<bentiss> holly cow: in a little bit more than 2 hours, we already had 59 GB of outgoing traffic
fomys_ has quit []
<Ford_Prefect> wow
<Ford_Prefect> Are pages are only expect to work after S3 sync is complete?
<mupuf> bentiss: could it be partly explained by the rclone?
<bentiss> mupuf: no, rclone is direclty poking at the S3 server, not gitlab
<Ford_Prefect> oh no, pages are up, let's see what's going on with pipewire.org
<bentiss> Ford_Prefect: pipewire might complain, I needed the DNS to propagate before requesting the certificates
<bentiss> let me fix that now
<emersion> pages give a 502 (but maybe that's expected)
<Ford_Prefect> ah, I was wondering if we needed to update DNS on the PipeWire side
<bentiss> Ford_Prefect: in thory no, I'd rather not
<bentiss> *rather not have to do this once again
<mupuf> bentiss: ack, thanks!
<mupuf> don't forget to take a break!
<Ford_Prefect> I think I misunderstood the setup -- the DNS seems okay, so likely propagation + ability to update certs should be it
<mupuf> as in, call it a day
<Ford_Prefect> Whatever you did worked now :)
AbleBacon has joined #freedesktop
<bentiss> yeah, cert-manager was waiting for modemmanager.org, which is the only one which needs manual updating
<bentiss> the webservice pods are just getting killed over and over
<bentiss> this remembers a lot last time when we did the db split
<bentiss> and I really don't like this feeling
<bentiss> well, I checked the parameters from the old db, and we were at 500 simultaneous connections. Here I was setting 1000 with 2 pools of 450 (main + ci). Trying to pimp up the settingsATM
dcunit3d has joined #freedesktop
krei-se has quit [Read error: Connection reset by peer]
krei-se has joined #freedesktop
<mupuf> pimp my pgsql 😅
krei-se- has joined #freedesktop
krei-se has quit [Ping timeout: 480 seconds]
jsa1 has quit [Ping timeout: 480 seconds]
krei-se has joined #freedesktop
<bentiss> I don't know. I give up for today, it's been a long day. I'll come back tomorrow I think
krei-se- has quit [Ping timeout: 480 seconds]
krei-se- has joined #freedesktop
krei-se has quit [Ping timeout: 480 seconds]
<mupuf> bentiss: sounds like a sane plan
krei-se has joined #freedesktop
krei-se- has quit [Ping timeout: 480 seconds]
<austriancoder> ufff.. gitlab can be that fast
guludo has quit [Ping timeout: 480 seconds]
<bentiss> heh, I just had an epiphany: let's have moaar webservice pods... this seems to do the trick, even if they are getting killed, we still have enough for new connections
<bentiss> and I spoke too fast: only 1 webservice pod is healthy out of the 15
<bentiss> I think I found one culprit: on the 2 db replica, the load average is respectively 35 and 50 on a 12 vcore machine
<bentiss> on the master, it's only 7-8
<bentiss> so yeah, we might need beafier machines :(
<bentiss> well, we should wait for fastly, this might kick out the bots and maybe reduce teh load
<eric_engestrom> bentiss: we love you and everything you've done, and it's ok that it's not fully working yet; log off and come back tomorrow :)
<bentiss> well, dinner time here, so yeah, going AFK
<MrCooper> metux is already pushing churn to xserver again :(
tzimmermann has quit [Quit: Leaving]
<mupuf> MrCooper: he is accelerating the decadence of X, I guess
krei-se- has joined #freedesktop
krei-se has quit [Ping timeout: 480 seconds]
<ofourdan> :(
haaninjo has joined #freedesktop
<Xe_> i gotta say, i love your maintenance page
<Xe_> 10/10
Xe_ is now known as Xe
ximion has joined #freedesktop
infernix has quit [Quit: ZNC - http://znc.sourceforge.net]
infernix has joined #freedesktop
<alanc> huh, I didn't get any email about new xserver MR's yet - is email not yet turned on for the new gitlab servers?
guludo has joined #freedesktop
<Ford_Prefect> I see at least 2 issue emails from today
<Ford_Prefect> RSS seems to be lagging though
JanC is now known as Guest11670
JanC has joined #freedesktop
Guest11670 has quit [Ping timeout: 480 seconds]
<alanc> huh, I wonder if the work spam filters are discarding the mails as coming from new IP addresses
sima has quit [Ping timeout: 480 seconds]
yusmatvei25 has quit []
mvlad has quit [Remote host closed the connection]
JanC is now known as Guest11675
JanC has joined #freedesktop
Guest11675 has quit [Ping timeout: 480 seconds]
sima has joined #freedesktop
sima has quit [Ping timeout: 480 seconds]
guludo has quit [Quit: WeeChat 4.5.2]
haaninjo has quit [Quit: Ex-Chat]