daniels changed the topic of #freedesktop to: GitLab is currently down for upgrade; will be a while before it's back || https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
dcunit3d has quit [Ping timeout: 480 seconds]
gry has quit [Quit: leaving]
psykose has quit [Ping timeout: 480 seconds]
psykose has joined #freedesktop
columbarius has joined #freedesktop
co1umbarius has quit [Ping timeout: 480 seconds]
Kayden has joined #freedesktop
<karolherbst> random fail in the debian-arm64-asan job: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48335136 And I don't really see what's going wrong there
psykose has quit [Ping timeout: 480 seconds]
epony has quit [Ping timeout: 480 seconds]
psykose has joined #freedesktop
psykose has quit [Remote host closed the connection]
psykose has joined #freedesktop
psykose has quit [Ping timeout: 480 seconds]
psykose has joined #freedesktop
ximion has quit [Quit: Detached from the Matrix]
<mupuf> karolherbst: the gating script failed, and stopped execution
<mupuf> bentiss: another one ^
AbleBacon has quit [Read error: Connection reset by peer]
Leopold_ has quit []
bmodem has joined #freedesktop
Leopold_ has joined #freedesktop
keypresser86 has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
bmodem has joined #freedesktop
bmodem has quit [Excess Flood]
bmodem has joined #freedesktop
keypresser86 has quit []
tzimmermann has joined #freedesktop
epony has joined #freedesktop
sima has joined #freedesktop
epony has quit []
epony has joined #freedesktop
mvlad has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
<bentiss> mupuf: OK, well then, I think we need someone to give me the proper bash command to fail if curl has and which writes a proper error log instead of just calling 'sh'
<bentiss> (FWIW taking the day off today, so spotty availability)
<mupuf> bentiss: sure! Enjoy your day off :)
<epony> off joy your day course
Ahuj has joined #freedesktop
bmodem has joined #freedesktop
<alatiera> the equinox runners seem to be often timing out or closing the connection to gitlab
<alatiera> they are hosted on the same place as gitlab itself iirc right?
feto_bastardo0 has joined #freedesktop
epony has quit [autokilled: This host violated network policy. Mail support@oftc.net if you think this is in error. (2023-09-01 08:47:19)]
feto_bastardo has quit [Ping timeout: 480 seconds]
mceier has quit [Quit: leaving]
mceier has joined #freedesktop
serjosna has joined #freedesktop
serjosna was kicked from #freedesktop by ChanServ [You are not permitted on this channel]
<tpalli> my ci pipeline passes but marge keeps saying it takes too long so merge fails
<tpalli> I'm wondering if the timeout should be bumped a bit or maybe this is just temporary issue
<mupuf> tpalli: yeah, that's the age-old issue of runners not prioritizing jobs from marge
mceier has quit [Quit: leaving]
mceier has joined #freedesktop
<karolherbst> yeah.. two days ago I had a pipeline succeeding after 57 minutes :')
<karolherbst> though I think it might be fine to bump the timeout for the general pipeline
<karolherbst> it's obviously worse to fail the entire thing just because the ordering got bad
<karolherbst> as this just means it will run the entire thing once again
<karolherbst> and make the problem just worse
<koike> mupuf: any pointers about how to fix this issue about runners prioritizing jobs from marge?
<mupuf> koike: no, but the issue is simple. Runners poll gitlab for jobs, and gitlab serves the oldest job first. No way around this.
<mupuf> A possible workaround would be to register more runners in gitlab than we have in real life, but that requires controlling when the gitlab-runner is polling to get a job (we only want to poll for a job from the low-priority queue if there are no jobs in the high-priority queue)
<mupuf> projects would set the priority they want using runner tags
<koike> mupuf: you mean dedicated tags (dedicated runners) for Marge?
<mupuf> The least amount of work to get there would be to add support in gitlab-runner for a one-shot operation: Poll gitlab once for a job, and if there was nothing, exit. If there was a job, exit when it is done
<mupuf> dedicated tags for Marge yes. Doesn't mean dedicated **hardware** runners though
<mupuf> as long as you can make sure that only one gitlab runner is active at all time per hardware runner
<koike> mupuf: hmm, I'm not very familiar how gitlab-runners work, I thought it was one-shot operation. Does a gitlab-runner gets a list of jobs and has its own queue?
<koike> mupuf: regarding the dedicated tags for Marge, can it be dynamic? Or Marge would need to apply a patch on top of the tree to change the tag?
<mupuf> koike: no, runners just poll gitlab at fixed intervals to get *1* job
<mupuf> gitlab is responsible for chosing which job is the most appropriate
killpid_ has joined #freedesktop
<mupuf> and guess what policy it is: oldest first
<mupuf> the issue with the current implementation of the gitlab runner is that we don't control when polling happens
<mupuf> you can't just expose 2 runners with different tags. Otherwise gitlab will think that it has access to two runners... when they are in fact backed by the same machine
<koike> any ideia on how hard would it be to add a new policy to gitlab ci?
<mupuf> that would be a major undertaking
<mupuf> what I think we need to do on the gitlab side is: add 2 new rest endpoints
<mupuf> the first one to list all the jobs a runner may take
<mupuf> and the second is to tell "I want to run this job"
<mupuf> this way, no need to find a way to teach gitlab how to prioritize your jobs
<mupuf> that means it will be up to the runners to pick what is the most important to them
<koike> I see
<DavidHeidelberg[m]> karolherbst: we had the discussion about increasing over 1h, while I was strongly for, I think arguments like it can make everything much worse won
<karolherbst> DavidHeidelberg[m]: I think it might probably make more sense to just have very tight timeouts on specific jobs instead so it doesn't get out of control
<karolherbst> but yeah...
<mupuf> karolherbst: timeouts are for when a job actually runs
<DavidHeidelberg[m]> Trick is... sometimes regular job which takes 10 minutes take 20, sometimes when issues arise 30... :P not mentioning download issues
<mupuf> what about how long they take to be picked up?
<DavidHeidelberg[m]> So there is best, when jobs have "pillow"
<karolherbst> ohh sure
<karolherbst> it's just annoying if the order of jobs running was very bad and it just reaps the entire pipeline, becuase some container jobs started at +50 minutes
<karolherbst> and I don't really see the benefit of capping the pipeline, because it doesn't actually prevent anything bad
<karolherbst> it just makes you restart the entire pipeline again
<karolherbst> which is worse
<mupuf> DavidHeidelberg[m]: what we need is a daemon that monitors for marge-created jobs that have not started after 20 minutes, creates a runner matching the tags needed by the job, then run the job /bin/true
<karolherbst> unless there is something very important the global pipeline timeout protects against
<karolherbst> but if it's not having enough runners available, timeing out the pipeline is _worse_ because the next MR will run into the same problem and on top of htat the first MR gets queued again (probably). Unless I'm missing something important here
<DavidHeidelberg[m]> Maybe Marge could detect that all jobs running && no longer than 20 minutes
<koike> interesting, dynamically spawn runners for specific tags when needed
<DavidHeidelberg[m]> Then it means job is not stuck and it started recently
<karolherbst> but then job specific timeouts are enough, no?
<karolherbst> or is there a problem with queued jobs waiting for runners, but never starting? but then again, reaping the pipeline won't fix this issue, would it?
<DavidHeidelberg[m]> If the job start 10 minutes before end and takes 15 ;) we could prevent stopping it
<karolherbst> the only benefit I can see here is that it would allow MRs to merge not requiring those jobs not having runners available
<karolherbst> DavidHeidelberg[m]: sure, but that job also has dependencies and everything, and they might have a short time they are queued but not started
<karolherbst> it kinda sounds like working around the global issue gets overly complex
<karolherbst> *global timeout
<karolherbst> uhm.. dependent jobs I menat
<DavidHeidelberg[m]> We have the issue usually with one or two jobs finishing too late, because all DUT was busy when Marge requested it
<karolherbst> but anyway, the most commen (or so I think) reason for the global timeout is not enough runners, and then developers reassigning to marge just makes it worse and I would be surprised if in the avarage case anything else will and is happening
<karolherbst> yeah
<karolherbst> my point is just: reaping the pipeline makes it actually worse instead of helping
<koike> mupuf: regarding your idea of the deamon and spawning gitlab-runners for a given tag. What would be the limit of runners in a given hardware runner
<mupuf> DavidHeidelberg[m]: hence why I believe we need this service that will just take over these jobs
<zmike> karolherbst: the global timeout prevents job times from continually increasing
<zmike> if there is no global timeout there is no incentive to keep job times down
<karolherbst> well
<karolherbst> it's fine to have that idea
<koike> btw, we can collect some data to see if the bottleneck is missing DUTs or missing runners
<karolherbst> but it doesn't seem to work
<mupuf> koike: well, there are no limits. But you probably just want to register the runner then remove it right after there are no jobs to auto-accept
<zmike> it mostly does, other than this moderately rare period of time when all jobs are failing all the time
<zmike> which is unrelated to the timeout
<karolherbst> I honestly don't think that this pipeline timeout actually matters as if we'd have long queues of MRs, somebody would shout on IRC why those pipelines are running to long anyway
<karolherbst> and if we run into that issue, we'll notive even without that global timeout
<karolherbst> but the global timeout actually causes issues which are preventable
<zmike> the issues you're seeing are just the symptoms of other issues
<karolherbst> sure, we could have more runners and everything
<zmike> jobs not being started when they should and such
<karolherbst> my point is just, the global timeout makes those situations worse
<zmike> in cases where ci is in a general state of shambles sure
<zmike> but that's not the common case
<zmike> mostly that's just the past few weeks with the infrastructure changes
<koike> tpalli: could you point me which pipelines you see that are taking too long? I want to see if they are being accounted here https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1&viewPanel=15&from=now-15d&to=now
<karolherbst> I agree, but I also don't see that the global timeout actually helps with preventing long running pipelines or developers going nuts on the jobs
<karolherbst> to me it just feels very pointless to have it
<karolherbst> so it doesn't help, and if it does something, it makes it simply worse
<zmike> as someone who's had many jobs canceled by it over the years I disagree
vkareh has joined #freedesktop
ximion has joined #freedesktop
<tpalli> zmike hit timeout with 26400 but eventually succeeded
AbleBacon has joined #freedesktop
<tpalli> koike hit timeout with 26400 but eventually succeeded :)
<zmike> I'm not saying it doesn't happen, I'm saying there's other issues that should be resolved
<koike> tpalli: is 26400 the MR number? at least I couldn't find it on https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26400
<tpalli> koike sorry it is 24600 ...I do need new glasses
<koike> 😎
ximion has quit [Quit: Detached from the Matrix]
<koike> tpalli: hmm, I see the last failed pipeline 975271, but I don't see the one prior to it https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1&viewPanel=16 , I also see that 975271 had lots of job retries
<koike> tpalli: it seems that in this case, the issue wasn't the queue, but flakes on the jobs (of infra errors): e.g. https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48357784
<koike> or* infra errors
bmodem has quit [Ping timeout: 480 seconds]
bmodem has joined #freedesktop
blatant has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
<karolherbst> "ERROR: Job failed (system failure): panic: runtime error: invalid memory address or nil pointer dereference"
<alatiera> ahahahahaha
<tpalli> koike ok thanks for checking
feto_bastardo0 has quit [Read error: Connection reset by peer]
Ahuj has quit [Ping timeout: 480 seconds]
<mupuf> Yeah, the gitlab runner isn't exactly the most reliable
<mupuf> I'm sure it is OK when everything runs in the same DC
<tintou> Most of my failures today are from fdo-equinix-m3l-22
feto_bastardo0 has joined #freedesktop
<vyivel> channel topic should be updated
tzimmermann has quit [Quit: Leaving]
<robclark> hmm, all the registry pulls are fail?
feto_bastardo0 has quit []
feto_bastardo has joined #freedesktop
mripard has quit [Quit: mripard]
GNUmoon has quit [Remote host closed the connection]
GNUmoon has joined #freedesktop
<karolherbst> mhhh https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48376178 "ERROR: Uploading artifacts as "archive" to coordinator... error error=couldn't execute POST"
GNUmoon has quit [Read error: Connection reset by peer]
GNUmoon has joined #freedesktop
blatant has quit [Quit: WeeChat 4.0.4]
<karolherbst> ERROR: Job failed: failed to pull image "registry.freedesktop.org/mesa/mesa/debian/x86_64_build:2023-06-24-agility-711--2023-08-30-bindgen-cli--d5aa3941aa03c2f716595116354fb81eb8012acb" with specified policies [if-not-present]: writing blob: storing blob to file "/var/tmp/storage3897829639/2": happened during read: read tcp 147.28.150.201:37610->147.75.198.156:443: read: connection timed out (manager.go:237:1230s)
<eric_engestrom> > ERROR: Job failed: failed to pull image "registry.freedesktop.org/mesa/mesa/debian/android_build:2023-06-24-agility-711--2023-08-30-bindgen-cli--d5aa3941aa03c2f716595116354fb81eb8012acb" with specified policies [if-not-present]: writing blob: adding layer with blob "sha256:9e2a801b4932d841170463d58ef5152fdfd791dfb413fe2d048af5c55b70c7a3": layer not known (manager.go:237:132s)
<eric_engestrom> as well as the same issue as karolherbst above (connection timed out) but that's just bad network, not sure we can do much about it
<eric_engestrom> and I'm also seeing a bunch of the ever-mysterious infinite hang while docker is fetching the job image, eg. https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48383700
<eric_engestrom> no idea what can be causing this
* eric_engestrom hopes everything will magically get better on monday with the gitlab upgrade :P
<karolherbst> mhhh
<karolherbst> maybe it's the runner
<eric_engestrom> also: ERROR: Job failed (system failure): Error response from daemon: container d5bee8d21abbd143891fb80a4483aab1dd8e308e4efc48b2ad683939511a8752 does not exist in database: no such container (exec.go:78:0s)
<karolherbst> yeah..
<karolherbst> it's all the same runner
<karolherbst> can we nuke that runner for now and see if it gets rid of those errors for now?
<eric_engestrom> looks like docker (the service) if seriously buggy on that runner
* eric_engestrom thinks all the admins are gone for the weekend :]
<eric_engestrom> karolherbst: it's not the runner though, it happens on both runners (fdo-equinix-m3l-22 and fdo-equinix-m3l-23)
<karolherbst> okay
<karolherbst> then we should nukethem both :P
<eric_engestrom> hehe, can't have flaky ci if you don't have ci anymore
flom84 has joined #freedesktop
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
flom84 has quit [Remote host closed the connection]
flom84 has joined #freedesktop
flom84 has quit []
utsweetyfish has quit [Remote host closed the connection]
utsweetyfish has joined #freedesktop
thaller is now known as Guest1465
thaller has joined #freedesktop
ximion has joined #freedesktop
Guest1465 has quit [Ping timeout: 480 seconds]
utsweetyfish has quit [Remote host closed the connection]
utsweetyfish has joined #freedesktop
vkareh has quit [Quit: WeeChat 3.6]
immibis has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
immibis has joined #freedesktop
mvlad has quit [Remote host closed the connection]
jarthur has quit [Remote host closed the connection]
sima has quit [Ping timeout: 480 seconds]
<DavidHeidelberg[m]> dj-death: nah, %20 aka space in the filename
<DavidHeidelberg[m]> dj-death: if you rebase, it's fixed by disabling the trace, + I have almost prepared replacement
<DavidHeidelberg[m]> Previously it somehow worked, but just due proxy being too much benevolent.
<dj-death> DavidHeidelberg[m]: oh right sorry
<dj-death> DavidHeidelberg[m]: my MR is pretty old and I avoided rebasing for the reviewer
<dj-death> DavidHeidelberg[m]: thanks!
AbleBacon has quit [Remote host closed the connection]
utsweetyfish has quit [Remote host closed the connection]
utsweetyfish has joined #freedesktop
<DavidHeidelberg[m]> dj-death: no problem :D you could eventually even cherry pick the fix :P :D
utsweetyfish has quit [Remote host closed the connection]
utsweetyfish has joined #freedesktop
jsto has quit [Quit: jsto]
jsto has joined #freedesktop
utsweetyfish has quit [Remote host closed the connection]
utsweetyfish has joined #freedesktop
utsweetyfish has quit [Remote host closed the connection]
utsweetyfish has joined #freedesktop