#freedesktop on 2025-03-18 — irc logs at oftc.irclog.whitequark.org

2024-07-16 04:52 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:00 Guest11598 has quit [Ping timeout: 480 seconds]

00:31 krastevm has joined #freedesktop

00:37 haaninjo has quit [Quit: Ex-Chat]

00:37 martink has quit [Ping timeout: 480 seconds]

00:41 blu has joined #freedesktop

00:47 krastevm has quit [Ping timeout: 480 seconds]

01:00 gachikuku has joined #freedesktop

01:08 scrumplex has joined #freedesktop

01:10 strugee has quit [Quit: ZNC - http://znc.in]

01:12 scrumplex_ has quit [Ping timeout: 480 seconds]

01:12 strugee has joined #freedesktop

02:18 kem has left #freedesktop [Leaving]

02:33 JanC is now known as Guest11606

02:33 JanC has joined #freedesktop

02:37 Guest11606 has quit [Ping timeout: 480 seconds]

02:39 kode54 has quit [Quit: The Lounge - https://thelounge.chat]

02:45 kode54 has joined #freedesktop

03:46 ximion1 has quit [Remote host closed the connection]

03:54 alanc has quit [Remote host closed the connection]

03:55 alanc has joined #freedesktop

04:25 qaqland has joined #freedesktop

04:29 eluks has quit [Remote host closed the connection]

04:30 eluks has joined #freedesktop

04:53 swatish2 has joined #freedesktop

05:23 swatish2 has quit [Ping timeout: 480 seconds]

05:33 swatish2 has joined #freedesktop

06:25 gnuiyl has quit [Remote host closed the connection]

06:26 gnuiyl has joined #freedesktop

07:00 sghuge has quit [Remote host closed the connection]

07:00 sghuge has joined #freedesktop

07:06 mvlad has joined #freedesktop

07:26 sima has joined #freedesktop

07:27 swatish2 has quit [Ping timeout: 480 seconds]

07:27 jsa1 has joined #freedesktop

07:49 swatish2 has joined #freedesktop

07:50 tzimmermann has joined #freedesktop

07:58 mripard has joined #freedesktop

08:02 swatish2 has quit [Ping timeout: 480 seconds]

08:06 ximion has joined #freedesktop

08:07 swatish2 has joined #freedesktop

08:13 ximion has quit [Remote host closed the connection]

08:28 karolherbst8 has joined #freedesktop

08:30 karolherbst has quit [Read error: Connection reset by peer]

08:38 AbleBacon has quit [Read error: Connection reset by peer]

08:47 <bentiss> daniels, MrCooper or anyone using ci-stats-grafana.fd.o: I've migrated the ci-stats namespace, can someone check that it's correct?

08:47 <bentiss> (of course, singing in with gitlab won't work

08:47 <bentiss> )

08:47 <MrCooper> haven't really used grafana in a long time

08:48 <bentiss> at least from the public view I can still see the data :)

08:50 <bentiss> looks like shutting down ci-stats-influxdb2 on equinix made everything disappear :)

08:51 <bentiss> DNS issue I would say

08:51 <bentiss> there we go

09:06 JerryXiao has quit [Remote host closed the connection]

09:07 JerryXiao has joined #freedesktop

09:11 karolherbst8 has quit []

09:11 karolherbst has joined #freedesktop

09:58 haaninjo has joined #freedesktop

10:24 haaninjo has quit [Quit: Ex-Chat]

10:31 qaqland has quit [Remote host closed the connection]

10:32 qaqland has joined #freedesktop

10:33 <mupuf> bentiss: What's the plan for re-enabling gitlab? Do you want to to bring back the web UI, then the runners, then marge last once the setup has been proven to work?

10:33 <mupuf> Do you want to wait for fastly first?

10:41 fomys_ has joined #freedesktop

11:01 <daniels> bentiss: it's looking good, thanks!

11:19 <bentiss> daniels: great!

11:19 <bentiss> mupuf: right now I'm still pulling the artifacts. 901 GB so far :(

11:20 <bentiss> I messed up that step in the preparation, I didn't realized the process was killed because ENOMEM instead of it being finished

11:20 <bentiss> so it's kind of the first one to come wins, but I really hope to have fastly upgraded tonight, so we can start use it

11:21 <bentiss> and we can not bring up the rest of the services because most of them make use of gitalb as oidc :(

11:24 guludo has joined #freedesktop

11:35 <mupuf> bentiss: oh, 901GB out of?

11:37 <daniels> mupuf: lol.

11:39 <mupuf> daniels: That's a nice oopsie, indeed :D

11:51 swatish2 has quit [Ping timeout: 480 seconds]

11:58 <bentiss> mupuf: artifacts and job logs

12:07 <mupuf> bentiss: Sure, but I meant how much is there to copy?

12:07 <bentiss> mupuf: good question, I have no idea

12:07 <mupuf> ha ha, ok

12:08 <bentiss> I have a rough idea of how much artifacts we have, but knowing which ones are more recent than a year is almost impossible

12:08 <mupuf> ok, and how much artifact do we have as a whole? That would be an upper bound

12:09 <bentiss> all I know is that before that run which lasted for the past 1d8h30m, I had ~408000 files in the bucket. This run checked ~395000 of those, so this is an indication we are getting closer to the end. But nothing is guaranteed

12:09 <bentiss> mupuf: 25 TB

12:11 <bentiss> we are at 1274552 files total, mesa over the past year made 27385 pipelines. Not sure how much jobs a mesa pipelines has, but that should give an idea of how much files we need to have at least

12:12 <bentiss> we can also decide to just drop the ball, and ignore the artifacts that are not pulled and I re-enable gitlab right now so we can continue on the migration of the services

12:13 <mupuf> bentiss: did you upload the artifacts from newest to oldest?

12:13 <bentiss> mupuf: I can't control that

12:13 <mupuf> ack

12:14 <mupuf> I would prefer waiting for the artifacts to be back before re-enabling the runners... but not everything is under your control

12:17 <mupuf> so, if we re-enable gitlab, we may have some jobs that will start running and fail due to missing artifacts (not the end of the world, really). Should we fear something worse than that?

12:17 guludo has quit [Ping timeout: 480 seconds]

12:17 <bentiss> mupuf: no jobs where running when I stopped it

12:19 <daniels> mupuf: that'd only affect jobs which need to pull artifacts from old pipelines, which is very few of them - I think the biggest issue would be needing to re-run pipelines for everything hosting pages

12:21 <bentiss> daniels: pages are on a different bucket

12:21 <bentiss> so they should eb fine

12:21 <bentiss> the sidekiq jobs takes the artifact and uploads it to pages

12:23 <mupuf> bentiss: ack, then I guess no need to wait for all the artifacts then. When you have nothing else you can do, then I would vote for you to re-enable gitlab without fastly

12:24 TrinitronX is now known as Guest11641

12:24 <mupuf> Then we need to add a banner to link to the migration page and say that CI is still disabled

12:24 TrinitronX has joined #freedesktop

12:24 <mupuf> and that we recommend against merging MRs

12:26 Guest11641 has quit [Ping timeout: 480 seconds]

12:26 <bentiss> daniels: ^^ your opinion on this?

12:27 * bentiss grumbles a bit because he shot down the regular ingress to only accept fastly in the cluster :/

12:27 <mupuf> hehe. What's this fastly upgrade you are waiting for?

12:27 <mupuf> is that so that we don't pay?

12:27 <daniels> I think for Fastly we should wait a bit later into US time to see if Karen is able to sign; for the artifacts I'm completely fine dropping the old ones for now and focusing on the registry instead

12:27 <daniels> but I don't think there's much sense to bring it up now when people can't use it for most things

12:28 <bentiss> it would be useful for external projects making use of ci-templates (gnome and red hat gitlab)

12:29 <mupuf> bentiss: we can't make the whole instance read only, right?

12:29 <mupuf> that would be perfect

12:29 <bentiss> for the registry, we need fastly, so I'm thinking we should just wait for the account, and then start migrating in the background

12:29 <bentiss> mupuf: it's a pain in the ass to do

12:29 <mupuf> then nevermind

12:30 <mupuf> daniels: any luck with the runners?

12:31 <daniels> mupuf: working on that today

12:31 <daniels> bentiss: are they needing the registry image too, or just the repo?

12:31 <bentiss> daniels: just the repo normally, we push the externally used images to quay

12:33 * mupuf should consider doing the same for ci-tron :D

12:34 swatish2 has joined #freedesktop

12:35 <mupuf> daniels: so, you would vote for gitlab to be back up only when all the services are ready, right?

12:36 <mupuf> So, the TODO list would be: artifacts, registry, runners?

12:39 <mupuf> I am voting for: when users can interact with gitlab, but CI is still down (TODO: registry, runners. Artifacts migration can be finished in the background as far as I am concerned)

12:39 <bentiss> nah, registry can be done in the background. Moving 10~12 TB of images will not happen overnight.

12:39 <mupuf> but we can't re-enable CI until the registry is migrated, can we?

12:39 <bentiss> and same for runners: to be able to test them, we need gitlab to be up, so we can rely on the equinix ones for a week or two

12:40 <bentiss> the registry doesn't depend on gitlab, it's a separate item

12:41 <bentiss> the registry uses oidc AFAIU, so as long as gitlab.fd.o is available, you can pull/push data even if they are not colocated in the same dc

12:41 <bentiss> (I might be wrong)

12:42 <bentiss> but I think it would make sense to bring things back up with a big notice that it's still not done, and that things might not be happy until the rest of the week

12:42 * mupuf agrees with that. Just need to figure out a concise way of what we want to communicate

12:44 <bentiss> "Migration is not done, please except hiccups and unnoticed shut down in case tests are not good. See maintenance.gitlab.freedesktop.org for the tracker of te rest"

12:44 <bentiss> *expect

12:44 <mupuf> git: OK. Comments: OK. Artifacts: partial. CI: WIP (registry: ???. runners: ???)

12:44 <bentiss> something along those lines

12:44 <bentiss> runners are still good, they haven't migrated yet

12:44 <mupuf> can we say that we do not expect data loss?

12:45 <mupuf> bentiss: allegedly still good. I'm sure there will be so fun there.

12:45 <bentiss> data loss, as long as you don't look for artifacts, this should be good

12:46 <mupuf> right, but I meant: stuff that you comment on or push is unlikely to get removed in a rollback

12:47 <mupuf> We could phrase it like: feel free to push trees, and write comments

12:48 <mupuf> actually, let's start simple: Consider gitlab as read only until services are tested and marked as good

12:48 <mupuf> then we add in maintenance a user-centric feature list: whatever we consider solid-enough for production use, we add a green tick

12:49 <mupuf> or we update the banner, but it would make it quite big

12:49 <bentiss> mupuf: last time we split the db, put back in prod and it was not good. So I won't guarantee a "stuff that you comment on or push is unlikely to get removed in a rollback"

12:49 <bentiss> mmaking it considered as read only is better

12:52 <hakzsam> you could also wait another day before re-enabling it (if you have doubts). People are expecting one week off anyways

12:52 <mupuf> hakzsam: waiting won't help us know

12:53 <mupuf> bentiss: would it be simple for you to add an htaccess on gitlab.freedesktop.org before re-enabling it?

12:53 <bentiss> anyway, my rclone sync process is getting killed

12:54 <hakzsam> mupuf: fair enough :)

12:54 <mupuf> this way we could at least test it a bit before opening the flood gates

13:04 <emersion> i'd also be in favor of bringing gitlab back up even if CI runners are off

13:05 <emersion> (read-only is useful, issues are useful, and some projects use their own runners)

13:05 <bentiss> but the runners *are* working, it's just that they are not in the correct place :)

13:11 swatish2 has quit [Ping timeout: 480 seconds]

13:13 <MrCooper> can you suspend/disable them until they're in the right place?

13:13 <bentiss> sure

13:14 <bentiss> but this will not prevent custom runners from personal projects

13:14 <MrCooper> seems fine?

13:15 <mupuf> yeah, let's disable the runners for now

13:16 <mupuf> custom runners, whatever ;)

13:17 <bentiss> for the aventurous people, use 138.199.132.39 in your /etc/host as gitlab.freedesktop.org and report, please

13:20 <bilboed> really minor detail : The reverse dns isn't configured properly

13:20 <bilboed> (i.e. pinging ex : ssh.gitlab.freedesktop.org returns the default your-server.de hostname)

13:20 <bentiss> bilboed: the dns isn't configured *at all*

13:23 <MrCooper> successfully logged in and loaded a couple of issues

13:23 <bilboed> lol, was already logged in it seems, retained sessions

13:24 <bilboed> did some git fetch, worked fine

13:25 <bilboed> looking through issues/mr also seems fine

13:25 <MrCooper> (it's very snappy, if only there were always this few users contending for resources :)

13:26 guludo has joined #freedesktop

13:26 <bentiss> damn, permission denied on some gitaly pods

13:27 <bilboed> ah, pipeline traces seem to be gone

13:28 swatish2 has joined #freedesktop

13:28 <bilboed> nvm, was reading in the wrong place

13:28 <bentiss> yeah pipeline traces is the biggest issue

13:28 <bilboed> traces are on s3 I imagine ?

13:29 <bilboed> (some are present, some aren't)

13:29 <bentiss> bilboed: yeah, that's the thing I was trying to pull for the past couple of days that just blew up

13:30 <bilboed> makes sense. Everything else seems fine (even checked user information, blames, etc...)

13:31 <bilboed> spoke too quickly, doesn't seem to be able to load user activity (spinner for ever)

13:32 <bilboed> oh, loads after a page refresh

13:33 <bentiss> we need to wait a bit before opening this up: ~11000 background jobs in the queue

13:33 <bilboed> OUCH :D

13:34 <bentiss> at least it's getting down (slowly)

13:35 * bentiss starts a bigger number of sidekiq pods, now that we have more room

13:38 <bentiss> though they do not seem to spin up :(

13:40 <dwt> All looks ok to me, signed in via google acct, sign-in security notification email worked, git-over-ssh works

13:42 swatish2 has quit [Ping timeout: 480 seconds]

13:44 <jrayhawk> Why did the apache2 service on annarchy get deactivated?

13:50 <bilboed> aaaah, the ssh-git address changed

13:50 <bentiss> bilboed: ATM both are still working

13:50 <bilboed> right

13:50 <bilboed> hmm... we'll need some big fat warnings everywhere regarding that

13:50 <bentiss> yeah, this will be required for fastly

13:51 <bentiss> we can't use port 22 for them

14:00 <bentiss> https://gitlab.freedesktop.org/mesa/mesa/container_registry -> I guess we need to sort out the registry somehow

14:02 <daniels> mm, and I wonder if it's worth using the window to move to the new one?

14:05 <bentiss> daniels: we are already on the new registry

14:05 <daniels> hm, I wonder why it's throwing banners about moving

14:05 <bentiss> we have online gc running for more than a year now

14:05 <bentiss> because it's dumb :)

14:07 <bentiss> we are definitely having issues connecting to the db

14:07 <bentiss> the pods are stuck in connecting

14:08 <bentiss> if any admin wants to change the new banner, feel free to do so

14:08 <mupuf> bentiss: it's looking ok

14:11 <bentiss> so... it seems the laod balancer is not happy with having the leader as one of the load balancer

14:12 <bentiss> which means switching the db leader is going to be fun

14:19 <bentiss> DNS as been updated to point at hetzner

14:19 <mupuf> bentiss: \o/

14:20 <mupuf> git seems to work well too

14:20 <bentiss> but now the maintenance page has wrong DNS entry :)

14:21 <bentiss> hmm I am pretty sure I bumped teh number of allowed connections to postgres

14:23 <mupuf> we are using a self hosted postgres, or a managed one?

14:26 <bentiss> self hosted, but on 3 bare metal

14:27 <bentiss> (dedicated)

14:27 <mupuf> ack

14:27 <mupuf> thanks

14:47 <karolherbst> it's so fast...

14:47 <karolherbst> I hope it stays that way

14:51 <bentiss> currently I'm getting too many concurrent accesses to the db, but I think I found the issue

14:54 swatish2 has joined #freedesktop

14:55 <bentiss> and that fixed it

14:58 <karolherbst> oh no.. it seems like it's not fast anymore :'(

14:59 <karolherbst> maybe I tried at the time where the bots didn't start hammering

15:11 hellfire7734club[m] has joined #freedesktop

15:14 <mupuf> karolherbst: seems fine here

15:15 yusmatvei25 has joined #freedesktop

15:16 swatish2 has quit [Ping timeout: 480 seconds]

15:19 <karolherbst> mhhh looks like initial connection is slow

15:19 <karolherbst> but yeah.. comments are loading real quickly

15:32 TrinitronX is now known as Guest11653

15:32 TrinitronX has joined #freedesktop

15:34 Guest11653 has quit [Ping timeout: 480 seconds]

15:36 <bilboed> getting 502s

15:43 <mupuf> probably too few DB connections again

15:45 <bentiss> nah, I've changed the settings, and now the connections are OK, the 502 means kubernetes is killing the pods because the readyness fails

15:45 <bentiss> and not sure why this is happening

15:49 georgc has quit [Quit: Leaving]

15:49 gchini has joined #freedesktop

15:53 agd5f_ has quit []

15:54 agd5f has joined #freedesktop

15:57 <bentiss> could be that we are getting hammered

15:59 <bilboed> ddos already ? :)

16:01 <bilboed> oh, didn't realize you updated the DNS. That would make sense indeed

16:03 <bilboed> yah, even a non-complex page (like /help/) takes forever

16:05 <bentiss> it could also be that the registry being non there make the internal requests stall too much

16:07 <bilboed> fwiw, not logged in seems to response quickly

16:07 <bilboed> (ish)

16:11 mripard has quit [Quit: WeeChat 4.5.1]

16:15 <bentiss> I've ended up simply disabling the liveness/readiness checks from kubernetes, and this seems muchbetter

16:15 swatish2 has joined #freedesktop

16:16 <bentiss> I have a strong feeling not being able to ping the registry is not helping gitlab to return healthy

16:16 <bentiss> which means: we need to get the registry on Hetzner ASAP

16:18 <bentiss> sigh... doesn't work

16:34 <mupuf> bentiss: oddly enough, my runners have access to the registry

16:35 * mupuf will disable his scheduled pipelines

16:35 <bentiss> mupuf the registry is still hosted on equinix, and the DNS points at it correctly. The problem is when gitlab tries to access it it used an internal name. I've fixed it to use the public DNS entry, we'll see if that helps

16:36 <mupuf> let's cross fingers

16:38 <bentiss> at least now https://gitlab.freedesktop.org/mesa/mesa/container_registry/ works

16:44 swatish2 has quit [Ping timeout: 480 seconds]

16:48 <bentiss> holly cow: in a little bit more than 2 hours, we already had 59 GB of outgoing traffic

16:55 fomys_ has quit []

16:55 <Ford_Prefect> wow

16:56 <Ford_Prefect> Are pages are only expect to work after S3 sync is complete?

16:56 <mupuf> bentiss: could it be partly explained by the rclone?

16:57 <bentiss> mupuf: no, rclone is direclty poking at the S3 server, not gitlab

16:57 <Ford_Prefect> oh no, pages are up, let's see what's going on with pipewire.org

16:58 <bentiss> Ford_Prefect: pipewire might complain, I needed the DNS to propagate before requesting the certificates

16:58 <bentiss> let me fix that now

16:59 <emersion> pages give a 502 (but maybe that's expected)

16:59 <Ford_Prefect> ah, I was wondering if we needed to update DNS on the PipeWire side

17:00 <bentiss> Ford_Prefect: in thory no, I'd rather not

17:00 <bentiss> *rather not have to do this once again

17:03 <mupuf> bentiss: ack, thanks!

17:03 <mupuf> don't forget to take a break!

17:03 <Ford_Prefect> I think I misunderstood the setup -- the DNS seems okay, so likely propagation + ability to update certs should be it

17:03 <mupuf> as in, call it a day

17:03 <Ford_Prefect> Whatever you did worked now :)

17:04 AbleBacon has joined #freedesktop

17:04 <bentiss> yeah, cert-manager was waiting for modemmanager.org, which is the only one which needs manual updating

17:07 <bentiss> the webservice pods are just getting killed over and over

17:07 <bentiss> this remembers a lot last time when we did the db split

17:08 <bentiss> and I really don't like this feeling

17:18 <bentiss> well, I checked the parameters from the old db, and we were at 500 simultaneous connections. Here I was setting 1000 with 2 pools of 450 (main + ci). Trying to pimp up the settingsATM

17:19 dcunit3d has joined #freedesktop

17:22 krei-se has quit [Read error: Connection reset by peer]

17:24 krei-se has joined #freedesktop

17:25 <mupuf> pimp my pgsql 😅

17:31 krei-se- has joined #freedesktop

17:34 krei-se has quit [Ping timeout: 480 seconds]

17:38 jsa1 has quit [Ping timeout: 480 seconds]

17:39 krei-se has joined #freedesktop

17:40 <bentiss> I don't know. I give up for today, it's been a long day. I'll come back tomorrow I think

17:42 krei-se- has quit [Ping timeout: 480 seconds]

17:45 krei-se- has joined #freedesktop

17:50 krei-se has quit [Ping timeout: 480 seconds]

17:50 <mupuf> bentiss: sounds like a sane plan

17:57 krei-se has joined #freedesktop

18:02 krei-se- has quit [Ping timeout: 480 seconds]

18:03 <austriancoder> ufff.. gitlab can be that fast

18:05 guludo has quit [Ping timeout: 480 seconds]

18:06 <bentiss> heh, I just had an epiphany: let's have moaar webservice pods... this seems to do the trick, even if they are getting killed, we still have enough for new connections

18:06 <bentiss> and I spoke too fast: only 1 webservice pod is healthy out of the 15

18:10 <bentiss> I think I found one culprit: on the 2 db replica, the load average is respectively 35 and 50 on a 12 vcore machine

18:10 <bentiss> on the master, it's only 7-8

18:11 <bentiss> so yeah, we might need beafier machines :(

18:12 <bentiss> well, we should wait for fastly, this might kick out the bots and maybe reduce teh load

18:22 <eric_engestrom> bentiss: we love you and everything you've done, and it's ok that it's not fully working yet; log off and come back tomorrow :)

18:23 <bentiss> well, dinner time here, so yeah, going AFK

18:33 <MrCooper> metux is already pushing churn to xserver again :(

18:41 tzimmermann has quit [Quit: Leaving]

18:53 <mupuf> MrCooper: he is accelerating the decadence of X, I guess

18:55 krei-se- has joined #freedesktop

19:00 krei-se has quit [Ping timeout: 480 seconds]

19:03 <ofourdan> :(

19:50 haaninjo has joined #freedesktop

20:00 <Xe_> i gotta say, i love your maintenance page

20:00 <Xe_> 10/10

20:00 Xe_ is now known as Xe

20:02 ximion has joined #freedesktop

20:03 infernix has quit [Quit: ZNC - http://znc.sourceforge.net]

20:06 infernix has joined #freedesktop

20:08 <alanc> huh, I didn't get any email about new xserver MR's yet - is email not yet turned on for the new gitlab servers?

20:09 guludo has joined #freedesktop

20:11 <Ford_Prefect> I see at least 2 issue emails from today

20:11 <Ford_Prefect> RSS seems to be lagging though

20:26 JanC is now known as Guest11670

20:26 JanC has joined #freedesktop

20:30 Guest11670 has quit [Ping timeout: 480 seconds]

20:40 <alanc> huh, I wonder if the work spam filters are discarding the mails as coming from new IP addresses

20:58 sima has quit [Ping timeout: 480 seconds]

21:04 yusmatvei25 has quit []

21:08 mvlad has quit [Remote host closed the connection]

21:15 JanC is now known as Guest11675

21:15 JanC has joined #freedesktop

21:20 Guest11675 has quit [Ping timeout: 480 seconds]

21:31 sima has joined #freedesktop

21:59 sima has quit [Ping timeout: 480 seconds]

22:00 guludo has quit [Quit: WeeChat 4.5.2]

22:27 haaninjo has quit [Quit: Ex-Chat]