#freedesktop on 2023-09-01 — irc logs at oftc.irclog.whitequark.org

2023-08-27 13:06 daniels changed the topic of #freedesktop to: GitLab is currently down for upgrade; will be a while before it's back || https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:12 dcunit3d has quit [Ping timeout: 480 seconds]

00:26 gry has quit [Quit: leaving]

00:27 psykose has quit [Ping timeout: 480 seconds]

00:30 psykose has joined #freedesktop

00:53 columbarius has joined #freedesktop

00:54 co1umbarius has quit [Ping timeout: 480 seconds]

01:08 Kayden has joined #freedesktop

01:12 <karolherbst> random fail in the debian-arm64-asan job: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48335136 And I don't really see what's going wrong there

01:17 psykose has quit [Ping timeout: 480 seconds]

01:19 epony has quit [Ping timeout: 480 seconds]

01:22 psykose has joined #freedesktop

01:25 psykose has quit [Remote host closed the connection]

01:26 psykose has joined #freedesktop

02:16 psykose has quit [Ping timeout: 480 seconds]

02:20 psykose has joined #freedesktop

02:58 ximion has quit [Quit: Detached from the Matrix]

03:21 <mupuf> karolherbst: the gating script failed, and stopped execution

03:21 <mupuf> bentiss: another one ^

03:31 AbleBacon has quit [Read error: Connection reset by peer]

04:13 Leopold_ has quit []

04:14 bmodem has joined #freedesktop

04:18 Leopold_ has joined #freedesktop

04:28 keypresser86 has joined #freedesktop

04:46 bmodem has quit [Ping timeout: 480 seconds]

04:47 bmodem has joined #freedesktop

04:47 bmodem has quit [Excess Flood]

04:48 bmodem has joined #freedesktop

05:17 keypresser86 has quit []

05:23 tzimmermann has joined #freedesktop

06:00 epony has joined #freedesktop

06:03 sima has joined #freedesktop

06:04 epony has quit []

06:15 epony has joined #freedesktop

06:53 mvlad has joined #freedesktop

07:03 bmodem has quit [Ping timeout: 480 seconds]

07:21 <bentiss> mupuf: OK, well then, I think we need someone to give me the proper bash command to fail if curl has and which writes a proper error log instead of just calling 'sh'

07:21 <bentiss> (FWIW taking the day off today, so spotty availability)

07:22 <mupuf> bentiss: sure! Enjoy your day off :)

07:23 <epony> off joy your day course

08:01 Ahuj has joined #freedesktop

08:04 bmodem has joined #freedesktop

08:24 <alatiera> the equinox runners seem to be often timing out or closing the connection to gitlab

08:26 <alatiera> they are hosted on the same place as gitlab itself iirc right?

08:44 feto_bastardo0 has joined #freedesktop

08:47 epony has quit [autokilled: This host violated network policy. Mail support@oftc.net if you think this is in error. (2023-09-01 08:47:19)]

08:50 feto_bastardo has quit [Ping timeout: 480 seconds]

10:07 mceier has quit [Quit: leaving]

10:09 mceier has joined #freedesktop

10:12 serjosna has joined #freedesktop

10:12 serjosna was kicked from #freedesktop by ChanServ [You are not permitted on this channel]

10:13 <tpalli> my ci pipeline passes but marge keeps saying it takes too long so merge fails

10:15 <tpalli> I'm wondering if the timeout should be bumped a bit or maybe this is just temporary issue

10:29 <mupuf> tpalli: yeah, that's the age-old issue of runners not prioritizing jobs from marge

10:46 mceier has quit [Quit: leaving]

10:48 mceier has joined #freedesktop

10:59 <karolherbst> yeah.. two days ago I had a pipeline succeeding after 57 minutes :')

11:01 <karolherbst> though I think it might be fine to bump the timeout for the general pipeline

11:01 <karolherbst> it's obviously worse to fail the entire thing just because the ordering got bad

11:01 <karolherbst> as this just means it will run the entire thing once again

11:01 <karolherbst> and make the problem just worse

11:27 <koike> mupuf: any pointers about how to fix this issue about runners prioritizing jobs from marge?

11:28 <mupuf> koike: no, but the issue is simple. Runners poll gitlab for jobs, and gitlab serves the oldest job first. No way around this.

11:30 <mupuf> A possible workaround would be to register more runners in gitlab than we have in real life, but that requires controlling when the gitlab-runner is polling to get a job (we only want to poll for a job from the low-priority queue if there are no jobs in the high-priority queue)

11:30 <mupuf> projects would set the priority they want using runner tags

11:32 <koike> mupuf: you mean dedicated tags (dedicated runners) for Marge?

11:32 <mupuf> The least amount of work to get there would be to add support in gitlab-runner for a one-shot operation: Poll gitlab once for a job, and if there was nothing, exit. If there was a job, exit when it is done

11:32 <mupuf> dedicated tags for Marge yes. Doesn't mean dedicated **hardware** runners though

11:34 <mupuf> as long as you can make sure that only one gitlab runner is active at all time per hardware runner

11:34 <koike> mupuf: hmm, I'm not very familiar how gitlab-runners work, I thought it was one-shot operation. Does a gitlab-runner gets a list of jobs and has its own queue?

11:35 <koike> mupuf: regarding the dedicated tags for Marge, can it be dynamic? Or Marge would need to apply a patch on top of the tree to change the tag?

11:36 <mupuf> koike: no, runners just poll gitlab at fixed intervals to get *1* job

11:37 <mupuf> gitlab is responsible for chosing which job is the most appropriate

11:37 killpid_ has joined #freedesktop

11:37 <mupuf> and guess what policy it is: oldest first

11:39 <mupuf> the issue with the current implementation of the gitlab runner is that we don't control when polling happens

11:39 <mupuf> you can't just expose 2 runners with different tags. Otherwise gitlab will think that it has access to two runners... when they are in fact backed by the same machine

11:40 <koike> any ideia on how hard would it be to add a new policy to gitlab ci?

11:41 <mupuf> that would be a major undertaking

11:41 <mupuf> what I think we need to do on the gitlab side is: add 2 new rest endpoints

11:41 <mupuf> the first one to list all the jobs a runner may take

11:41 <mupuf> and the second is to tell "I want to run this job"

11:42 <mupuf> this way, no need to find a way to teach gitlab how to prioritize your jobs

11:42 <mupuf> that means it will be up to the runners to pick what is the most important to them

11:42 <koike> I see

11:42 <DavidHeidelberg[m]> karolherbst: we had the discussion about increasing over 1h, while I was strongly for, I think arguments like it can make everything much worse won

11:44 <karolherbst> DavidHeidelberg[m]: I think it might probably make more sense to just have very tight timeouts on specific jobs instead so it doesn't get out of control

11:44 <karolherbst> but yeah...

11:45 <mupuf> karolherbst: timeouts are for when a job actually runs

11:45 <DavidHeidelberg[m]> Trick is... sometimes regular job which takes 10 minutes take 20, sometimes when issues arise 30... :P not mentioning download issues

11:45 <mupuf> what about how long they take to be picked up?

11:45 <DavidHeidelberg[m]> So there is best, when jobs have "pillow"

11:45 <karolherbst> ohh sure

11:45 <karolherbst> it's just annoying if the order of jobs running was very bad and it just reaps the entire pipeline, becuase some container jobs started at +50 minutes

11:46 <karolherbst> and I don't really see the benefit of capping the pipeline, because it doesn't actually prevent anything bad

11:46 <karolherbst> it just makes you restart the entire pipeline again

11:46 <karolherbst> which is worse

11:47 <mupuf> DavidHeidelberg[m]: what we need is a daemon that monitors for marge-created jobs that have not started after 20 minutes, creates a runner matching the tags needed by the job, then run the job /bin/true

11:47 <karolherbst> unless there is something very important the global pipeline timeout protects against

11:48 <karolherbst> but if it's not having enough runners available, timeing out the pipeline is _worse_ because the next MR will run into the same problem and on top of htat the first MR gets queued again (probably). Unless I'm missing something important here

11:48 <DavidHeidelberg[m]> Maybe Marge could detect that all jobs running && no longer than 20 minutes

11:48 <koike> interesting, dynamically spawn runners for specific tags when needed

11:48 <DavidHeidelberg[m]> Then it means job is not stuck and it started recently

11:48 <karolherbst> but then job specific timeouts are enough, no?

11:49 <karolherbst> or is there a problem with queued jobs waiting for runners, but never starting? but then again, reaping the pipeline won't fix this issue, would it?

11:50 <DavidHeidelberg[m]> If the job start 10 minutes before end and takes 15 ;) we could prevent stopping it

11:50 <karolherbst> the only benefit I can see here is that it would allow MRs to merge not requiring those jobs not having runners available

11:50 <karolherbst> DavidHeidelberg[m]: sure, but that job also has dependencies and everything, and they might have a short time they are queued but not started

11:51 <karolherbst> it kinda sounds like working around the global issue gets overly complex

11:51 <karolherbst> *global timeout

11:51 <karolherbst> uhm.. dependent jobs I menat

11:52 <DavidHeidelberg[m]> We have the issue usually with one or two jobs finishing too late, because all DUT was busy when Marge requested it

11:52 <karolherbst> but anyway, the most commen (or so I think) reason for the global timeout is not enough runners, and then developers reassigning to marge just makes it worse and I would be surprised if in the avarage case anything else will and is happening

11:53 <karolherbst> yeah

11:53 <karolherbst> my point is just: reaping the pipeline makes it actually worse instead of helping

11:53 <koike> mupuf: regarding your idea of the deamon and spawning gitlab-runners for a given tag. What would be the limit of runners in a given hardware runner

11:53 <mupuf> DavidHeidelberg[m]: hence why I believe we need this service that will just take over these jobs

11:53 <zmike> karolherbst: the global timeout prevents job times from continually increasing

11:53 <zmike> if there is no global timeout there is no incentive to keep job times down

11:54 <karolherbst> well

11:54 <karolherbst> it's fine to have that idea

11:54 <koike> btw, we can collect some data to see if the bottleneck is missing DUTs or missing runners

11:54 <karolherbst> but it doesn't seem to work

11:54 <mupuf> koike: well, there are no limits. But you probably just want to register the runner then remove it right after there are no jobs to auto-accept

11:54 <zmike> it mostly does, other than this moderately rare period of time when all jobs are failing all the time

11:54 <zmike> which is unrelated to the timeout

11:56 <karolherbst> I honestly don't think that this pipeline timeout actually matters as if we'd have long queues of MRs, somebody would shout on IRC why those pipelines are running to long anyway

11:56 <karolherbst> and if we run into that issue, we'll notive even without that global timeout

11:56 <karolherbst> but the global timeout actually causes issues which are preventable

11:57 <zmike> the issues you're seeing are just the symptoms of other issues

11:57 <karolherbst> sure, we could have more runners and everything

11:57 <zmike> jobs not being started when they should and such

11:57 <karolherbst> my point is just, the global timeout makes those situations worse

11:57 <zmike> in cases where ci is in a general state of shambles sure

11:58 <zmike> but that's not the common case

11:58 <zmike> mostly that's just the past few weeks with the infrastructure changes

11:58 <koike> tpalli: could you point me which pipelines you see that are taking too long? I want to see if they are being accounted here https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1&viewPanel=15&from=now-15d&to=now

11:58 <karolherbst> I agree, but I also don't see that the global timeout actually helps with preventing long running pipelines or developers going nuts on the jobs

11:59 <karolherbst> to me it just feels very pointless to have it

11:59 <karolherbst> so it doesn't help, and if it does something, it makes it simply worse

11:59 <zmike> as someone who's had many jobs canceled by it over the years I disagree

12:11 vkareh has joined #freedesktop

12:24 ximion has joined #freedesktop

12:29 <tpalli> zmike hit timeout with 26400 but eventually succeeded

12:30 AbleBacon has joined #freedesktop

12:33 <tpalli> koike hit timeout with 26400 but eventually succeeded :)

12:33 <zmike> I'm not saying it doesn't happen, I'm saying there's other issues that should be resolved

12:34 <koike> tpalli: is 26400 the MR number? at least I couldn't find it on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26400

12:39 <tpalli> koike sorry it is 24600 ...I do need new glasses

12:39 <koike> 😎

12:43 ximion has quit [Quit: Detached from the Matrix]

12:49 <koike> tpalli: hmm, I see the last failed pipeline 975271, but I don't see the one prior to it https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1&viewPanel=16 , I also see that 975271 had lots of job retries

12:52 <koike> tpalli: it seems that in this case, the issue wasn't the queue, but flakes on the jobs (of infra errors): e.g. https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48357784

12:52 <koike> or* infra errors

12:53 <koike> e.g. 2: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48357712

13:23 bmodem has quit [Ping timeout: 480 seconds]

13:26 bmodem has joined #freedesktop

13:28 blatant has joined #freedesktop

13:34 bmodem has quit [Ping timeout: 480 seconds]

13:51 <karolherbst> mhhh https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48374197

13:51 <karolherbst> "ERROR: Job failed (system failure): panic: runtime error: invalid memory address or nil pointer dereference"

13:52 <alatiera> ahahahahaha

13:52 <tpalli> koike ok thanks for checking

14:21 feto_bastardo0 has quit [Read error: Connection reset by peer]

14:38 Ahuj has quit [Ping timeout: 480 seconds]

14:42 <mupuf> Yeah, the gitlab runner isn't exactly the most reliable

14:42 <mupuf> I'm sure it is OK when everything runs in the same DC

14:44 <tintou> Most of my failures today are from fdo-equinix-m3l-22

14:45 <tintou> Like the two failing jobs here https://gitlab.freedesktop.org/tintou/virglrenderer/-/jobs/48376125

14:54 feto_bastardo0 has joined #freedesktop

14:56 <vyivel> channel topic should be updated

14:57 tzimmermann has quit [Quit: Leaving]

14:58 <robclark> hmm, all the registry pulls are fail?

15:08 feto_bastardo0 has quit []

15:10 feto_bastardo has joined #freedesktop

15:12 mripard has quit [Quit: mripard]

15:27 GNUmoon has quit [Remote host closed the connection]

15:28 GNUmoon has joined #freedesktop

15:29 <karolherbst> mhhh https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48376178 "ERROR: Uploading artifacts as "archive" to coordinator... error error=couldn't execute POST"

15:29 GNUmoon has quit [Read error: Connection reset by peer]

15:30 GNUmoon has joined #freedesktop

15:31 blatant has quit [Quit: WeeChat 4.0.4]

15:59 <karolherbst> https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48379510

15:59 <karolherbst> ERROR: Job failed: failed to pull image "registry.freedesktop.org/mesa/mesa/debian/x86_64_build:2023-06-24-agility-711--2023-08-30-bindgen-cli--d5aa3941aa03c2f716595116354fb81eb8012acb" with specified policies [if-not-present]: writing blob: storing blob to file "/var/tmp/storage3897829639/2": happened during read: read tcp 147.28.150.201:37610->147.75.198.156:443: read: connection timed out (manager.go:237:1230s)

16:53 <eric_engestrom> https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48383140

16:53 <eric_engestrom> > ERROR: Job failed: failed to pull image "registry.freedesktop.org/mesa/mesa/debian/android_build:2023-06-24-agility-711--2023-08-30-bindgen-cli--d5aa3941aa03c2f716595116354fb81eb8012acb" with specified policies [if-not-present]: writing blob: adding layer with blob "sha256:9e2a801b4932d841170463d58ef5152fdfd791dfb413fe2d048af5c55b70c7a3": layer not known (manager.go:237:132s)

16:54 <eric_engestrom> as well as the same issue as karolherbst above (connection timed out) but that's just bad network, not sure we can do much about it

16:55 <eric_engestrom> and I'm also seeing a bunch of the ever-mysterious infinite hang while docker is fetching the job image, eg. https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48383700

16:55 <eric_engestrom> no idea what can be causing this

16:55 * eric_engestrom hopes everything will magically get better on monday with the gitlab upgrade :P

16:56 <karolherbst> mhhh

16:56 <karolherbst> maybe it's the runner

16:57 <eric_engestrom> also: ERROR: Job failed (system failure): Error response from daemon: container d5bee8d21abbd143891fb80a4483aab1dd8e308e4efc48b2ad683939511a8752 does not exist in database: no such container (exec.go:78:0s)

16:57 <eric_engestrom> https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48381525

16:57 <karolherbst> yeah..

16:57 <karolherbst> it's all the same runner

16:57 <karolherbst> can we nuke that runner for now and see if it gets rid of those errors for now?

16:57 <eric_engestrom> looks like docker (the service) if seriously buggy on that runner

16:58 * eric_engestrom thinks all the admins are gone for the weekend :]

17:01 <eric_engestrom> karolherbst: it's not the runner though, it happens on both runners (fdo-equinix-m3l-22 and fdo-equinix-m3l-23)

17:06 <karolherbst> okay

17:06 <karolherbst> then we should nukethem both :P

17:33 <eric_engestrom> hehe, can't have flaky ci if you don't have ci anymore

18:21 flom84 has joined #freedesktop

18:43 alanc has quit [Remote host closed the connection]

18:43 alanc has joined #freedesktop

18:55 flom84 has quit [Remote host closed the connection]

18:55 flom84 has joined #freedesktop

18:58 flom84 has quit []

19:05 utsweetyfish has quit [Remote host closed the connection]

19:05 utsweetyfish has joined #freedesktop

19:13 thaller is now known as Guest1465

19:13 thaller has joined #freedesktop

19:17 ximion has joined #freedesktop

19:19 Guest1465 has quit [Ping timeout: 480 seconds]

19:31 utsweetyfish has quit [Remote host closed the connection]

19:33 utsweetyfish has joined #freedesktop

19:36 vkareh has quit [Quit: WeeChat 3.6]

19:53 immibis has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

19:54 immibis has joined #freedesktop

20:09 mvlad has quit [Remote host closed the connection]

20:41 jarthur has quit [Remote host closed the connection]

20:51 sima has quit [Ping timeout: 480 seconds]

21:30 <dj-death> is this a network issue on the Material Testers trace? : https://mesa.pages.freedesktop.org/-/mesa/-/jobs/48399364/artifacts/results/summary/results/trace@gl-intel-glk@godot@Material%20Testers.x86_64_2020.04.08_13.38_frame799.rdc.html

21:34 <DavidHeidelberg[m]> dj-death: nah, %20 aka space in the filename

21:35 <DavidHeidelberg[m]> dj-death: if you rebase, it's fixed by disabling the trace, + I have almost prepared replacement

21:36 <DavidHeidelberg[m]> Previously it somehow worked, but just due proxy being too much benevolent.

21:38 <dj-death> DavidHeidelberg[m]: oh right sorry

21:38 <dj-death> DavidHeidelberg[m]: my MR is pretty old and I avoided rebasing for the reviewer

21:38 <dj-death> DavidHeidelberg[m]: thanks!

22:16 AbleBacon has quit [Remote host closed the connection]

22:37 utsweetyfish has quit [Remote host closed the connection]

22:38 utsweetyfish has joined #freedesktop

22:46 <DavidHeidelberg[m]> dj-death: no problem :D you could eventually even cherry pick the fix :P :D

22:50 utsweetyfish has quit [Remote host closed the connection]

22:51 utsweetyfish has joined #freedesktop

23:22 jsto has quit [Quit: jsto]

23:23 jsto has joined #freedesktop

23:27 utsweetyfish has quit [Remote host closed the connection]

23:27 utsweetyfish has joined #freedesktop

23:52 utsweetyfish has quit [Remote host closed the connection]

23:53 utsweetyfish has joined #freedesktop