daniels changed the topic of #freedesktop to: GitLab is currently down for upgrade; will be a while before it's back || https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
psykose has quit [Remote host closed the connection]
psykose has joined #freedesktop
psykose has quit [Ping timeout: 480 seconds]
psykose has joined #freedesktop
ximion has quit [Quit: Detached from the Matrix]
<mupuf>
karolherbst: the gating script failed, and stopped execution
<mupuf>
bentiss: another one ^
AbleBacon has quit [Read error: Connection reset by peer]
Leopold_ has quit []
bmodem has joined #freedesktop
Leopold_ has joined #freedesktop
keypresser86 has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
bmodem has joined #freedesktop
bmodem has quit [Excess Flood]
bmodem has joined #freedesktop
keypresser86 has quit []
tzimmermann has joined #freedesktop
epony has joined #freedesktop
sima has joined #freedesktop
epony has quit []
epony has joined #freedesktop
mvlad has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
<bentiss>
mupuf: OK, well then, I think we need someone to give me the proper bash command to fail if curl has and which writes a proper error log instead of just calling 'sh'
<bentiss>
(FWIW taking the day off today, so spotty availability)
<mupuf>
bentiss: sure! Enjoy your day off :)
<epony>
off joy your day course
Ahuj has joined #freedesktop
bmodem has joined #freedesktop
<alatiera>
the equinox runners seem to be often timing out or closing the connection to gitlab
<alatiera>
they are hosted on the same place as gitlab itself iirc right?
feto_bastardo0 has joined #freedesktop
epony has quit [autokilled: This host violated network policy. Mail support@oftc.net if you think this is in error. (2023-09-01 08:47:19)]
feto_bastardo has quit [Ping timeout: 480 seconds]
mceier has quit [Quit: leaving]
mceier has joined #freedesktop
serjosna has joined #freedesktop
serjosna was kicked from #freedesktop by ChanServ [You are not permitted on this channel]
<tpalli>
my ci pipeline passes but marge keeps saying it takes too long so merge fails
<tpalli>
I'm wondering if the timeout should be bumped a bit or maybe this is just temporary issue
<mupuf>
tpalli: yeah, that's the age-old issue of runners not prioritizing jobs from marge
mceier has quit [Quit: leaving]
mceier has joined #freedesktop
<karolherbst>
yeah.. two days ago I had a pipeline succeeding after 57 minutes :')
<karolherbst>
though I think it might be fine to bump the timeout for the general pipeline
<karolherbst>
it's obviously worse to fail the entire thing just because the ordering got bad
<karolherbst>
as this just means it will run the entire thing once again
<karolherbst>
and make the problem just worse
<koike>
mupuf: any pointers about how to fix this issue about runners prioritizing jobs from marge?
<mupuf>
koike: no, but the issue is simple. Runners poll gitlab for jobs, and gitlab serves the oldest job first. No way around this.
<mupuf>
A possible workaround would be to register more runners in gitlab than we have in real life, but that requires controlling when the gitlab-runner is polling to get a job (we only want to poll for a job from the low-priority queue if there are no jobs in the high-priority queue)
<mupuf>
projects would set the priority they want using runner tags
<koike>
mupuf: you mean dedicated tags (dedicated runners) for Marge?
<mupuf>
The least amount of work to get there would be to add support in gitlab-runner for a one-shot operation: Poll gitlab once for a job, and if there was nothing, exit. If there was a job, exit when it is done
<mupuf>
dedicated tags for Marge yes. Doesn't mean dedicated **hardware** runners though
<mupuf>
as long as you can make sure that only one gitlab runner is active at all time per hardware runner
<koike>
mupuf: hmm, I'm not very familiar how gitlab-runners work, I thought it was one-shot operation. Does a gitlab-runner gets a list of jobs and has its own queue?
<koike>
mupuf: regarding the dedicated tags for Marge, can it be dynamic? Or Marge would need to apply a patch on top of the tree to change the tag?
<mupuf>
koike: no, runners just poll gitlab at fixed intervals to get *1* job
<mupuf>
gitlab is responsible for chosing which job is the most appropriate
killpid_ has joined #freedesktop
<mupuf>
and guess what policy it is: oldest first
<mupuf>
the issue with the current implementation of the gitlab runner is that we don't control when polling happens
<mupuf>
you can't just expose 2 runners with different tags. Otherwise gitlab will think that it has access to two runners... when they are in fact backed by the same machine
<koike>
any ideia on how hard would it be to add a new policy to gitlab ci?
<mupuf>
that would be a major undertaking
<mupuf>
what I think we need to do on the gitlab side is: add 2 new rest endpoints
<mupuf>
the first one to list all the jobs a runner may take
<mupuf>
and the second is to tell "I want to run this job"
<mupuf>
this way, no need to find a way to teach gitlab how to prioritize your jobs
<mupuf>
that means it will be up to the runners to pick what is the most important to them
<koike>
I see
<DavidHeidelberg[m]>
karolherbst: we had the discussion about increasing over 1h, while I was strongly for, I think arguments like it can make everything much worse won
<karolherbst>
DavidHeidelberg[m]: I think it might probably make more sense to just have very tight timeouts on specific jobs instead so it doesn't get out of control
<karolherbst>
but yeah...
<mupuf>
karolherbst: timeouts are for when a job actually runs
<DavidHeidelberg[m]>
Trick is... sometimes regular job which takes 10 minutes take 20, sometimes when issues arise 30... :P not mentioning download issues
<mupuf>
what about how long they take to be picked up?
<DavidHeidelberg[m]>
So there is best, when jobs have "pillow"
<karolherbst>
ohh sure
<karolherbst>
it's just annoying if the order of jobs running was very bad and it just reaps the entire pipeline, becuase some container jobs started at +50 minutes
<karolherbst>
and I don't really see the benefit of capping the pipeline, because it doesn't actually prevent anything bad
<karolherbst>
it just makes you restart the entire pipeline again
<karolherbst>
which is worse
<mupuf>
DavidHeidelberg[m]: what we need is a daemon that monitors for marge-created jobs that have not started after 20 minutes, creates a runner matching the tags needed by the job, then run the job /bin/true
<karolherbst>
unless there is something very important the global pipeline timeout protects against
<karolherbst>
but if it's not having enough runners available, timeing out the pipeline is _worse_ because the next MR will run into the same problem and on top of htat the first MR gets queued again (probably). Unless I'm missing something important here
<DavidHeidelberg[m]>
Maybe Marge could detect that all jobs running && no longer than 20 minutes
<koike>
interesting, dynamically spawn runners for specific tags when needed
<DavidHeidelberg[m]>
Then it means job is not stuck and it started recently
<karolherbst>
but then job specific timeouts are enough, no?
<karolherbst>
or is there a problem with queued jobs waiting for runners, but never starting? but then again, reaping the pipeline won't fix this issue, would it?
<DavidHeidelberg[m]>
If the job start 10 minutes before end and takes 15 ;) we could prevent stopping it
<karolherbst>
the only benefit I can see here is that it would allow MRs to merge not requiring those jobs not having runners available
<karolherbst>
DavidHeidelberg[m]: sure, but that job also has dependencies and everything, and they might have a short time they are queued but not started
<karolherbst>
it kinda sounds like working around the global issue gets overly complex
<karolherbst>
*global timeout
<karolherbst>
uhm.. dependent jobs I menat
<DavidHeidelberg[m]>
We have the issue usually with one or two jobs finishing too late, because all DUT was busy when Marge requested it
<karolherbst>
but anyway, the most commen (or so I think) reason for the global timeout is not enough runners, and then developers reassigning to marge just makes it worse and I would be surprised if in the avarage case anything else will and is happening
<karolherbst>
yeah
<karolherbst>
my point is just: reaping the pipeline makes it actually worse instead of helping
<koike>
mupuf: regarding your idea of the deamon and spawning gitlab-runners for a given tag. What would be the limit of runners in a given hardware runner
<mupuf>
DavidHeidelberg[m]: hence why I believe we need this service that will just take over these jobs
<zmike>
karolherbst: the global timeout prevents job times from continually increasing
<zmike>
if there is no global timeout there is no incentive to keep job times down
<karolherbst>
well
<karolherbst>
it's fine to have that idea
<koike>
btw, we can collect some data to see if the bottleneck is missing DUTs or missing runners
<karolherbst>
but it doesn't seem to work
<mupuf>
koike: well, there are no limits. But you probably just want to register the runner then remove it right after there are no jobs to auto-accept
<zmike>
it mostly does, other than this moderately rare period of time when all jobs are failing all the time
<zmike>
which is unrelated to the timeout
<karolherbst>
I honestly don't think that this pipeline timeout actually matters as if we'd have long queues of MRs, somebody would shout on IRC why those pipelines are running to long anyway
<karolherbst>
and if we run into that issue, we'll notive even without that global timeout
<karolherbst>
but the global timeout actually causes issues which are preventable
<zmike>
the issues you're seeing are just the symptoms of other issues
<karolherbst>
sure, we could have more runners and everything
<zmike>
jobs not being started when they should and such
<karolherbst>
my point is just, the global timeout makes those situations worse
<zmike>
in cases where ci is in a general state of shambles sure
<zmike>
but that's not the common case
<zmike>
mostly that's just the past few weeks with the infrastructure changes
<karolherbst>
I agree, but I also don't see that the global timeout actually helps with preventing long running pipelines or developers going nuts on the jobs
<karolherbst>
to me it just feels very pointless to have it
<karolherbst>
so it doesn't help, and if it does something, it makes it simply worse
<zmike>
as someone who's had many jobs canceled by it over the years I disagree
vkareh has joined #freedesktop
ximion has joined #freedesktop
<tpalli>
zmike hit timeout with 26400 but eventually succeeded
AbleBacon has joined #freedesktop
<tpalli>
koike hit timeout with 26400 but eventually succeeded :)
<zmike>
I'm not saying it doesn't happen, I'm saying there's other issues that should be resolved
<eric_engestrom>
> ERROR: Job failed: failed to pull image "registry.freedesktop.org/mesa/mesa/debian/android_build:2023-06-24-agility-711--2023-08-30-bindgen-cli--d5aa3941aa03c2f716595116354fb81eb8012acb" with specified policies [if-not-present]: writing blob: adding layer with blob "sha256:9e2a801b4932d841170463d58ef5152fdfd791dfb413fe2d048af5c55b70c7a3": layer not known (manager.go:237:132s)
<eric_engestrom>
as well as the same issue as karolherbst above (connection timed out) but that's just bad network, not sure we can do much about it
* eric_engestrom
hopes everything will magically get better on monday with the gitlab upgrade :P
<karolherbst>
mhhh
<karolherbst>
maybe it's the runner
<eric_engestrom>
also: ERROR: Job failed (system failure): Error response from daemon: container d5bee8d21abbd143891fb80a4483aab1dd8e308e4efc48b2ad683939511a8752 does not exist in database: no such container (exec.go:78:0s)