daniels changed the topic of #freedesktop to: GitLab is currently down for upgrade; will be a while before it's back || https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
todi1 has joined #freedesktop
todi has quit [Ping timeout: 480 seconds]
lkundrak has quit [Ping timeout: 480 seconds]
marcheu has quit [Ping timeout: 480 seconds]
dri-logger has quit [Ping timeout: 480 seconds]
glisse has quit [Ping timeout: 480 seconds]
lkundrak has joined #freedesktop
marcheu has joined #freedesktop
dri-logger has joined #freedesktop
glisse has joined #freedesktop
dri-logg1r has joined #freedesktop
glisse has quit [Ping timeout: 480 seconds]
marcheu has quit [Ping timeout: 480 seconds]
lkundrak has quit [Ping timeout: 480 seconds]
dri-logger has quit [Ping timeout: 480 seconds]
lkundrak has joined #freedesktop
marcheu has joined #freedesktop
columbarius has joined #freedesktop
glisse has joined #freedesktop
co1umbarius has quit [Ping timeout: 480 seconds]
<zmike> followup wtf is going on with the zink-lvp job https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48516368
<zmike> I don't remember this taking more than 15 minutes
ximion has quit [Quit: Detached from the Matrix]
alpernebbi has quit [Ping timeout: 480 seconds]
alpernebbi has joined #freedesktop
feto_bastardo has quit [Quit: quit]
atticf_ has joined #freedesktop
bmodem has joined #freedesktop
AbleBacon has quit [Read error: Connection reset by peer]
atticf has quit [Ping timeout: 480 seconds]
swatish2 has joined #freedesktop
sima has joined #freedesktop
lsd|2 has joined #freedesktop
tzimmermann has joined #freedesktop
<airlied> bentiss, DavidHeidelberg[m] : so the llvmpipe-traces started flaking 3 days ago I think, but I've no idea if it's caused by overloaded infrastructure or a mesa commit
<airlied> jobs that never timed out in the past have started to timeout
<mupuf> zmike: "System load: 24.49 38.54 43.95", so it is definitely busy
<airlied> I wonder if the same thing is killing the traces timeouts
<mupuf> aren't traces depending heavily on s3? We know that service there has been flaky...
<mupuf> bentiss was saying earlier that he would like to try switching the runners back to debian and see if it helped with network connectivity
<airlied> well I think the traces all download to the image fine, it's just running them various ones seems to flake out
<airlied> but it's pretty scattered which ones
<mupuf> I thought they were streamed, not just downloaded then executed
<mupuf> this was to reduce memory usage on drive-less devices
<airlied> oh actually yes they are
<airlied> so it could be that
An0num0us has joined #freedesktop
<mupuf> airlied: that's what you meant, right? https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48527599
<airlied> yes that's it, there's been a bunch of those over the last 3 days
<bentiss> airlied: yes, as mupuf said, I strongly suspect this is related to coreos. At least, I don't see such errors after a quick look on the last remaining debian runner :/
<bentiss> it could very well be my network configuration on the equinix machines, or interactions with firewalld
<bentiss> but if I switch back to debian, I can use the equinix provided base config, which already has the network setup, so this might help us solve that issue
Ahuj has joined #freedesktop
alatiera has quit [Quit: The Lounge - https://thelounge.chat]
alatiera has joined #freedesktop
<DavidHeidelberg[m]> Yes, traces timeout is usually because s3 problems these days
An0num0us has quit [Ping timeout: 480 seconds]
mripard has joined #freedesktop
pkira has joined #freedesktop
mripard has quit []
mripard has joined #freedesktop
samuelig_ has quit []
samuelig has joined #freedesktop
Haaninjo has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
<bentiss> damn... I am trying to reinstall the runners with debian 12 and docker instead of podman, and the first attempt already is servicing jobs properly... this is highly suspicious (that I manage to do something right from the very first attempt)
<mupuf> lol
<mupuf> well, let's cross fingers that it would indeed be a weird incompatibility between gitlab and podman
<mupuf> (and debian vs coreos for the network reliability)
<mupuf> the thing is, my runners are also experiencing weird network issues too... but they are not hosted in datacenters
<bentiss> actually, I am leaning toward a containerization network issue (so podman, in coreos)
Haaninjo has quit [Quit: Ex-Chat]
<mupuf> crossing fingers!
<bentiss> FWIW, I've now turned off all of the coreos x86_64 runners, and enabled only debian_12+docker ones. WE'll see if that helps
<mupuf> thanks!
<daniels> bentiss: thankyou! fingers crossed
<bentiss> daniels: just FYI, while creating aarch64: "Error: Could not create Device: 422 There aren't available servers in that metro"
<mupuf> oh boy, so what can we do then>
<mupuf> ?
<mupuf> keep retrying until a customer releases a node?
<bentiss> nothing. We already have our 2 runners reserved, but I'm surprised we are at capacity
<bentiss> mupuf: no, we ca reinstall the current ones, it's just slightly more manual manipulations to do
<mupuf> I see
<mupuf> fdo-equinix-m3l-21 is paused
<bentiss> mupuf: fdo-equinix-m3l-21 is gone
<bentiss> :)
<mupuf> :)
<mupuf> bentiss: can I remove fdo-placeholder-equinix-17 ?
<mupuf> it was created 4 months ago
<mupuf> and last seen one month ago
<bentiss> mupuf: it's one reminder that we need to set this one up once again, so please no
<mupuf> oki docki
<mupuf> what's the story there?
<bentiss> I can't remember which parameters this one need (but an unusually large amount of parallel jobs and capacity for just waiting other componenents)
<alatiera> I think I managed to switch all the gst runners to debian12+podman a while ago
<bentiss> mupuf: some jobs are not really gitlab-ready, and they are using the placeholder (maybe to start another pipeline, can't remember IIRC), and the job just waits for the other job to finish. Which means it's not using any resources
<mupuf> oooohhhh
<bentiss> alatiera: and do you have timeout issues while accessing gitlab.fd.o?
<alatiera> bentiss nothing worse than when using docker
<alatiera> we had a couple of issues between htz <-> packet in general, but nothing changed much in docker v podman deployments
<bentiss> alatiera: k, and nothing being really bad the past few weeks?
<alatiera> not that I've seen or heard
<bentiss> cause since a couple (maybe 3) weeks, the equinix runners started to behave really badly
<alatiera> all the failed jobs were on packet runners
<alatiera> (also debian 12 has way older podman)
<bentiss> yeah, I suspect an automated update of coreos triggered a corner case in the network stack on new kernel/podman
<bentiss> which is why I'm reverting back to debian
<bentiss> the docker change is because I don't have a strong justification for podman, and it's a tad simpler to deploy
<bentiss> anyway, we'll see if the runners are behiving better now
<alatiera> what I haven't seen in a while with the podman runners is the weird "docker failed to connect to socket" error
<bentiss> rebooting ml-27, it looks like al lof its jobs are failing (and that's the only difference with 25 and 26, just a reboot)
<alatiera> is there a coreos runner that isn't missbehaving?
<bentiss> alatiera: no all of them were having issues
<bentiss> and the new cluster I tried to migrate to was also showing some weird networking errors
<bentiss> like 10 minutes timeouts on some operations
<alatiera> if we know it's 3 weeks ago and you still have a machine around you could try deploying an older ref/image of coreos
<bentiss> not sure I want to spend that much time
<bentiss> but yes, it's a good idea
<alatiera> understandable
<alatiera> haven't touched rpm-ostree systems in a while, else I'd paste the commands, it's 2-3 to rollback
<daniels> bentiss: still DC or NY?
<bentiss> alatiera: yeah, it's not much that I don't know how to do it, but more that I don't want to spent time debugging it
<bentiss> daniels: DC
<daniels> also, placeholder-job takes like 1024 jobs, with the runner-global concurrent limit bumped accordingly
<daniels> indeed they’re just used for jobs which launch external things and poll
<bentiss> daniels: I think we should just register the placeholder job in the basic deployment, so we don't forget this in the future
<bentiss> daniels: but that's something we can deal with when you are back
<alatiera> in gst we also use the placeholder-job for linter that execute within a minute fyi
lsd|2 has quit []
<bentiss> alatiera: actually the 2 arm equinix runners are still running coreos. I can try your suggestion. It'll be less common to hit the issue, but at least we have the folks using those runners here mostly
<daniels> bentiss: yeah totally, we can take them on all the runners
atticf_ has quit [Quit: leaving]
atticf has joined #freedesktop
<bentiss> mupuf, daniels, whot: FYI: I've just enabled "admin mode", so we can get API token without admin privileges. Please let me know if this is too painful
<mupuf> YES!
<mupuf> wonderful, thanks!
<daniels> \o/
bmodem has joined #freedesktop
bbhtt- is now known as bbhtt
An0num0us has joined #freedesktop
<zmike> mupuf: that's definitely not ideal
<mupuf> zmike: ECONTEXT
<mupuf> oh, the system load?
<zmike> yes
<zmike> well, unless it's just from the job itself, but that seems unlikely
<mupuf> there are apparently 64 cores, so if the runner was really over-used, then maybe IO is a little slower than expected?
<zmike> if only we had some way to graph job runtime over time
<daniels> it’s fine? they have 64 threads …
<daniels> right
<zmike> then something else is going seriously wrong
<zmike> the job never used to take 20+ mins
vkareh has joined #freedesktop
<daniels> if the runner networking was known totally broken and we know the traces stream down, prob that?
<zmike> not talking about trace job
<zmike> the zink-lvp job is apparently taking 20+ minutes now
<zmike> which is not the time I remember it taking
<daniels> ah yes
<daniels> the 131s test seemed particularly egregious
AbleBacon has joined #freedesktop
<bentiss> I am stress testing the infra (with ci-templates), and I got some news:
<bentiss> good news: the debian+docker runners don't seem to suffer from the timeout issues
<bentiss> bad news: second time in 3h that one of these new runners just freezes without any way to access it or reboot it
<bentiss> informational news: arm runners, which are still running podman/coreos are still affected by the timeout issues and the "let's mark that job correct but silently fail" (https://gitlab.freedesktop.org/bentiss/ci-templates/-/jobs/48547219)
<alatiera> freeze you say
<alatiera> we had an issue with a specific machine in the gst runners, but it seemed more of cursed hw issue
<bentiss> alatiera: second in a raw, that's suspicious
<alatiera> but if your debian 12 runner is also locking up..
<bentiss> yeah, these are brand new debian 12 runners that are locking up
<alatiera> what machine is it
<bentiss> and no way to restore them through the rescue console or even reinstall
<bentiss> so I guess bad HW or rack?
<alatiera> ours problematic one is a 7950x, the other ones are either an epyc 7502p or 5950x all same deployment no issues
<bentiss> EPYC 7513
<bentiss> I think it's an equinix issue
<bentiss> huh, using podman instead of docker makes the runner starts immediately when docker required a reboot otherwise all jobs were failed :)
<alatiera> oh lol, never noticed
* alatiera always reboots after updates
<alatiera> there were a couple of quirks with apparmor and podman btw
<bentiss> that one is happy for now, we'll see if there are issues :)
<alatiera> some policy that isn't installed by default or needs a config change, I can go dig the setup if you need
<bentiss> that would be nice (not sure if apparmor is enabled by default)
<bentiss> apparmor_status says it's enabled
<alatiera> `apt install -y curl jq podman podman-docker uidmap apparmor apparmor-utils containers-storage gitlab-runner slirp4netns dbus-user-session` and `apt remove fuse-overlayfs` is what we do
<bentiss> besides containers-storage, podman-docker and apparmor-utils that's what we have
<bentiss> without fuse-overlayfs removal
<bentiss> I'll keep an eye
<alatiera> I think fuse-overlayfs is only an issue if you try to run unpriv
<alatiera> which is funny cause that's what it was introduced for
<bentiss> heh
<alatiera> apparmor-utils provides some binary that's only available in a root/sudo -i path, I don't recall
<alatiera> but podman was complaining about missing it at some point
mripard has quit [Quit: mripard]
mripard has joined #freedesktop
agd5f has quit [Remote host closed the connection]
agd5f has joined #freedesktop
psukys has joined #freedesktop
mripard has quit [Quit: mripard]
<zmike> is this the new normal or ?
<bentiss> zmike: how long is it supposed to take?
<zmike> like 15 mins
<bentiss> zmike: could be an overloaded runner too, I see virgl-renderer jobs taking a long time on this runner
<zmike> these are all cpu-only jobs, so they don't have dedicated runners
<zmike> it seems like something has changed recently to massively increase load here
<bentiss> zmike: I see a lot of cpu use from crosvm
<bentiss> that runner is on its knees basically
<bentiss> Load average: 63.93 (64 cpus)
<zmike> 🤕
<zmike> what can be done about this?
<bentiss> so there is probably something wrong in their .gitlab-ci.yaml
<zmike> tintou: ?
<bentiss> they are probably not honoring FDO_CI_CONCURRENT
<zmike> or maybe not a tintou job specifically 🤔
<bentiss> zmike: and FTR, I'm switching the runners back to debian, because coreos was showing some serious network issues
<zmike> cool
<zmike> bentiss: how to evaluate the FDO_CI_CONCURRENT thing?
<bentiss> (but that might also explain why some behaviors changed)
<tintou> We're using the same scripts are in Mesa here
<bentiss> hmm... looks like we have 8 crosvm processes running, each of them are having '-c 8', so that's a 64 in total
<bentiss> (or could be that each subprocess is reporting the same command line too)
agd5f has quit [Remote host closed the connection]
tzimmermann has quit [Quit: Leaving]
agd5f has joined #freedesktop
vkareh has quit [Quit: WeeChat 3.6]
vkareh has joined #freedesktop
vkareh has quit []
vkareh has joined #freedesktop
vkareh has quit []
<tintou> Ah, I see what is happening
vkareh has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
<zmike> oh ?
bmodem has joined #freedesktop
<tintou> What surprises me is that you're seeing -c 8
<tintou> That should only be possible for venus
ximion has joined #freedesktop
ximion has quit [Quit: Detached from the Matrix]
<zmike> valve ci looks to be overloaded atm, which is probably going to fail any merges that need to go through it
<DavidHeidelberg[m]> ^^^ mupuf:
<mupuf> zmike: seems to be the mesa releases that use them
<mupuf> but they don't have anything else in the queue AFAICT
<mupuf> anyway, a 4th navi21 should appear this week
<zmike> I think until that time the jobs should be removed from pre-merged if a single adjacent pipeline can trigger merge failures
<mupuf> it wouldn't have solved this issue (3 pipelines scheduled in the same hour), but... it may have helped
kxkamil has quit []
<mupuf> that was 2
<zmike> 2 is not a very high number
<zmike> if it's known the hardware is so limited then either the jobs shouldn't be in pre-merge or they should be queuing much more aggressively to prevent the pipeline from starting at all
<mupuf> queueing more aggressively?
<zmike> whatever you want to call it
<zmike> if there aren't runners available then the pipeline should wait
<mupuf> right, that's another option for marge...
<zmike> if we're not doing that already then we should be
<zmike> and marge should block all other jobs from starting in turn
<mupuf> yeah, I proposed that marge should cancel any unstarted jobs using runners it wants to use, then requeue right after
<mupuf> that would allow it to jump the queue
<zmike> seems like a good idea
alatiera_afk[m] is now known as alatiera[m]
<zmike> but I'm look at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23217 now and it's going to now be 6 attempts to merge
<zmike> which is just not acceptable for ci
<zmike> so we need to set a path towards resolving these issues
<mupuf> anyway, adding the 4th runner will improve execution time a bit, but until job prioritization and preemption is implemented, I can never guarantee they will be available
<mupuf> +1 for that
kxkamil has joined #freedesktop
alanc has quit [Remote host closed the connection]
Ahuj has quit [Ping timeout: 480 seconds]
<eric_engestrom> the generic runners looks completely overwhelmed, with the sanity job having to wait 15min (https://gitlab.freedesktop.org/mesa/mesa/-/jobs/48562960) before anything in the pipeline can start when trying to merge
<zmike> this is a capacity issue then?
<eric_engestrom> someone is running a lot of cpu jobs
<eric_engestrom> or "someones"
<eric_engestrom> iirc we've often had issues with gstreamer (am I remembering this right?) when they have releases or something
<eric_engestrom> might be that, and it will get better once they are done
agd5f has quit [Read error: Connection reset by peer]
agd5f has joined #freedesktop
bmodem has quit [Ping timeout: 480 seconds]
<anholt_> yeah, right now it's the piglit rebuild typing up 6 slots on general CPU runners.
<anholt_> gstreamer just had 2 slots used when I looked right now.
alanc has joined #freedesktop
Haaninjo has joined #freedesktop
<__tim> eric_engestrom, there's not a single gstreamer job running at the moment (and not much activity in general), I think it's all mesa at the moment :)
<mupuf> Let's hope that fewer failed pipelines due to the bad runners will lead to lower DUT utilisation
pkira has quit [Ping timeout: 480 seconds]
psukys has quit [Ping timeout: 480 seconds]
lsd|2 has joined #freedesktop
swatish21 has joined #freedesktop
swatish2 has quit [Ping timeout: 480 seconds]
vkareh has quit [Ping timeout: 480 seconds]
swatish21 has quit [Ping timeout: 480 seconds]
ximion has joined #freedesktop
An0num0us has quit [Ping timeout: 480 seconds]
sima has quit [Ping timeout: 480 seconds]
emery has left #freedesktop [https://quassel-irc.org - Chat comfortably. Anywhere.]
Haaninjo has quit [Quit: Ex-Chat]