ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
danvet has quit [Ping timeout: 480 seconds]
Haaninjo has quit [Quit: Ex-Chat]
agd5f_ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
ximion has quit []
airlied has joined #freedesktop
<airlied>
what is up with all the 503 ci-fairy errors in mesa CI?
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<bentiss>
airlied: I can see that the radosgw that handles the file uploads is getting OOMKilled at roughly the time of the job (last was at Tue, 17 Jan 2023 05:34:33 +0100)
<bentiss>
the pod is on the node where the backup happens, and it has Memory limits that might be a little bit too low
<bentiss>
I'll try to do something about it
<airlied>
bentiss: yeah I lost a lot of jobs to 503 around that time, maybe it's just my timezone gets smashed due to the backups :-P
<bentiss>
airlied: that is likely to happen, but still I do not see a big memory usage for this pod compared to the others, so getting OOMKilled is suspicious
<bentiss>
yep, it definitely get killed at 2GB of usage, which is a bit low given the size of the uploads
<bentiss>
airlied: I have set the memory limits to 10 GB, we'll see if it behaves better
<bentiss>
daniels: BTW I checked over the week end that the registry gc was working. And it does :)
<bentiss>
the only weird thing is that if a blob is already in the db, it can take up to 48h +- 3h to get cleaned up
<bentiss>
but it eventually get cleaned
danvet has joined #freedesktop
<daniels>
bentiss: awesome!
<bentiss>
daniels: still no luck at migrating existing repos to the new db though FWIW
<mupuf>
daniels: thanks for the headsup!
<mupuf>
bentiss: that is wonderful news! 48h is not an issue when we would probably want to keep images for 30 days or so
<bentiss>
mupuf: yeah, as long as we do not keep images forever, even a week or a month would have been fine ;)
<mupuf>
exactly :)
* mupuf
is checking out if the new runner interface is compatible with automatic creation of runners
<mupuf>
but I love the fact that runners would be associated to users!
<bentiss>
daniels, mupuf: re gitlab-runner on rootless podman: the problem I have now is that I can't leverage kvm on the runners, which is problematic
<bentiss>
I think I can do it on my local fedora, so I must be missing something
<mupuf>
yeah, you probably are: is the podman user in the kvm group?
<bentiss>
yes
<mupuf>
and next: is the gitlab runner using --privileged? If not, did you add -v /dev/kvm:/dev/kvm?
<bentiss>
I tried both IIRC and both were failing
<mupuf>
weird...
<bentiss>
yeah. Well I was doing the tests yesterday evening, so I must have missed something
<mupuf>
the thing with fedora is that it was tweaked to work nicely in containers :s
<bentiss>
heh, yeah :)
<mupuf>
so, yeah for doing the work, but then we need to figure out what we need to do to make it work on other distros :D
<mupuf>
at least, reading the dockerfile of the podman or fedora images helped me
<bentiss>
so, on fedora (as a user): "podman run --rm -ti --device /dev/kvm registry.freedesktop.org/freedesktop/ci-templates/qemu-build-base:2022-08-22.0" works fine
* mupuf
is checking it out on arch
<mupuf>
any simple command I could run to check if kvm works?
<mupuf>
or should I just check permissions in /dev?
<bentiss>
mupuf: once in that container above, run /app/vmctl start
<bentiss>
permissions of /dev/kvm are useless because they are mapped
<mupuf>
yeah
<mupuf>
seems to have worked
* bentiss
did a chmod o+rw /dev/kvm on the equinix host
<mupuf>
qemu did not complain about kvm, and ssh worked
<mupuf>
bentiss: what's the current rights on /dev/kvm on the host?
<mupuf>
crw-rw-rw- 1 root kvm 10, 232 Jan 17 10:26 /dev/kvm --> this is what I have on arch
<bentiss>
and yes, changing the o+rw works out right
<bentiss>
I guess I'll just add that permissions to /dev/kvm, and move on
<bentiss>
the benefits of using rootless are much better that trying to fix that issue on podman
<mupuf>
for sure!
<mupuf>
thanks for working on this :)
<bentiss>
hopefully this will reduce the costs (because I won't have to wipe the registry cache every week)
<mupuf>
why is wiping the registry costing?
<mupuf>
costing anything*
<mupuf>
Do you have to create a new volume, then remove the old one?
<bentiss>
I am wiping the registry *cache*
<bentiss>
so every Monday, we rebuild it and pull data from GCS
<bentiss>
and I can not switch to harbor caching right now because docker doesn't allow me to rewrite the mirror. I was overwriting the IP address of the registry to force use the cache :)
<bentiss>
so if I switch to podamn, I can use rootless containers AND harbor as a cache, which has smart gc and which will delete blobs unused in the past 7 days
<bentiss>
so in theory, no more wipe of the cache
* bentiss
needs to check if network-manager is happy with rootless podman (and we probably need it to not be happy)
<mupuf>
and the registry also seems to be affected
<bentiss>
so it might be a ceph issue
<bentiss>
or... the db disk full
<bentiss>
yeah, 98.26% full, let me bump it
<mupuf>
well, seems to have instantly helped :)
___nick___ has joined #freedesktop
<bentiss>
indeed :)
<mupuf>
thanks a lot!
<bentiss>
no worries :)
<bentiss>
so... I reverted all of the newly installed packages, rebooting m3l-10, and I'll see if there is any improvement
itoral has quit [Remote host closed the connection]
<bentiss>
and if it doesn't work I guess I'll just throw away m3l-10 and create m3l-11
<bentiss>
alright, deploying m3l-11 because 10 seems to take twice the time than the other
___nick___ has quit []
agd5f_ has joined #freedesktop
___nick___ has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
<bentiss>
brand new runner, and not necessary faster... the only difference now is the kernel and gitlab-runner
<mupuf>
bentiss: :s
<mupuf>
Not sure if I would prefer it being a gitlab runner or kernel issue...
<mupuf>
well, I would prefer gitlab runner, easier to downgrade and maintain
<bentiss>
well, I don't see how gitlab-runner could be involved, unless they changed how they spin up their containers
<bentiss>
actually, one thing I haven't checkd is the virglrenderer script, maybe they throttle it for non MR
<daniels>
bentiss: we don't throttle based on MRs, no
<daniels>
bentiss: we do throttle on $FDO_CI_CONCURRENT tho
<daniels>
well, not throttle, but
<bentiss>
shit, how to spend a day: FDO_CI_CONCURRENT=1 on my test runner :(
<bentiss>
how to *lose*
<daniels>
oh no ...
<bentiss>
daniels: do I keep m3l-10 or 11?
<daniels>
no coffee this morning?
<daniels>
bentiss: if 11 works then keep 11
* bentiss
doesn't drink coffee
<bentiss>
daniels: OK
<bentiss>
FWIW, I took the gitlab-runner config from 10 (after stopping the gitlab-runner service) so all current runners are still there (placeholder and whatnot)
<daniels>
ooh, nice
<bentiss>
I don't understand why the gitlab-runner starts a cerbero job that is not in the list of running jobs and the global runner is disabled for this machine :(
<bentiss>
anyway... let's try with podamn (as root) and FDO_CONCURRENT set to 8 now
<bentiss>
and it does accelerate the whole thing, finally, thanks daniels for pointing this one out, I read the config file too many times thinking they are identical
<bentiss>
daniels: so current plan: use podman as root for all gitlab runners, and force the use of harbor.fd.o as the registry cache (it might also solve the need for registry-direct.fd.o)
<bentiss>
daniels: then continue the tests for rootless podman
<daniels>
bentiss: sounds like a great plan :)
<bentiss>
daniels: and kill with fire registry-mirror too :)
<daniels>
\o/
<bentiss>
mupuf: ^^ we need an update on vm2c for that btw, we need s/registry-mirror.freedesktop.org/harbor.freedesktop.org\/cache//'
<mupuf>
bentiss: ... I feel for you :s
<bentiss>
well, that's only on me that one :(
<mupuf>
I can work on the vm2c update!
<bentiss>
thanks
<bentiss>
daniels: can you explain what is the purpose of placeholder-equinix?
<bentiss>
because I probably killed 2 gstreamer piplines by killing jobs that were triggered on this runner ;)
<daniels>
bentiss: we can probably start getting rid of that now - we used it when we didn't have the ability to trigger dependent pipelines, so when e.g. virglrenderer wants to trigger a mesa CI pipeline, it can do that without occupying a full slot - same for gst which kicks off dependent builds via cerbero, polls on the status, then reports back to the parent
<bentiss>
oh, OK
<daniels>
(there are also some scheduled jobs for mesa etc which just run some pretty trivial python processing to shove data into Grafana and shouldn't occupy a full slot)
<daniels>
basically, stuff that can run in the background with zero effect
<bentiss>
k, thanks
<__tim>
ah ok, so that's what happened to those jobs
<__tim>
we were a bit puzzled :)
<bentiss>
__tim: sorry, I had to kill gitlab-runner, and this was blocking me to do it
<bentiss>
__tim: I just changed the config on that machine and should not have to re-change it in the short term, so you might want to restart it
<__tim>
ok, thanks
<bentiss>
well, I haven't prune the disk, but I guess I can live with the symbolic link
<alanc>
ERROR: Job failed (system failure): Error response from daemon: container create: unable to find 'docker.io/library/sha256:1ce0b2a729de804418dc7e2ac5f894e099b056c77ddd054a1ea0cf20cb685b00' in local storage: no such image (docker.go:534:0s)
<bentiss>
alanc: yes, I have seen that. I have disabled m3l-11 for now
<bentiss>
and checking on m3l-12 ATM
<alanc>
ah, my pipeline just passed, thanks!
<bentiss>
alanc: it went on a different runner :)
<alanc>
nothing like a little unexpected failure to add to the stress of pushing out a security fix release/advisory 8-)
<bentiss>
sorry :(
<alanc>
you fixed it, nothing to apologize for there
<bentiss>
daniels: I might as well nuke m3l-11 and start from scratch again like I am doing with the rest of the flock
<robclark>
bentiss: it was a bunch of random jobs.. but I _think_ it has recovered
<daniels>
bentiss: sounds good, thanks :)
<bentiss>
daniels: more like I'm going to move the placeholder-equinix to m3l-12, and then nuke m3l-11 and we'll monitor how things are going before migrating 8 and 9
<bentiss>
daniels: actually it was 3 years ago IIRC, I had 2 xdc presentations about it and I wasn't there this year
<daniels>
ci-templates \o/
ElectricJozin has joined #freedesktop
<ElectricJozin>
How can I view the session bus with busctl?
ElectricJozin has quit []
ElectricJozin has joined #freedesktop
ElectricJozin has quit [Remote host closed the connection]
ElectricJozin has joined #freedesktop
ElectricJozin has quit []
ElectricJozin has joined #freedesktop
<ElectricJozin>
How can I view the session bus with busctl?
AbleBacon has joined #freedesktop
ElectricJozin has quit [Quit: Leaving]
<mupuf>
The problem with ci templates and IGT is that it isn't solving anything for it. IGT wants to build a container per push for simplicity and reproducibility. Now that harbour is in place, isn't it a perfectly acceptable solution?
<mupuf>
Just to be clear, every push generates one new layer on top of the base container
<mupuf>
And the base container is seldomly regenerated
<mupuf>
So each push generates something like 5MB stored in a registry rather than in gitlab's artifact
<mupuf>
What I think IGT should do is use buildah rather than docker
<mupuf>
As for the base container, they can be generated using ci templates, or buildah. It shouldn't matter
<mupuf>
I can take a crack at it tomorrow, if you want
<DavidHeidelberg[m]>
How it's the LAVA tag passed as a tag to the Gitlab CI ?
<DavidHeidelberg[m]>
exactly, the "devicetype" is passed as a "tag" in GitlabCI
<daniels>
DavidHeidelberg[m]: yeah, each LAVA device tag has a matching GitLab CI tag
<DavidHeidelberg[m]>
but how it gets propagated? Since I seems to not see the new device
<daniels>
ah, yes
<daniels>
I'll set that up in the morning - it's not automatic, you have to log in to brea and register a new runner type
<DavidHeidelberg[m]>
thanks or if you want to, I can do it (if you can give me some rights for this task)
<DavidHeidelberg[m]>
(ofc tomorrow)
Haaninjo has joined #freedesktop
<daniels>
yeah, I can walk you through it tomorrow
<jenatali>
When that happens, is there harm in hitting the manual merge button?
<anholt>
she's still on that MR, and it looks good to me. ack.
<jenatali>
Good enough for me
<jenatali>
Which, harkening back to the convo about permissions for main, is a good reason to keep allowing devs access, until that bug is fixed
___nick___ has quit [Ping timeout: 480 seconds]
mvlad has quit [Remote host closed the connection]
ybogdano has quit [Ping timeout: 480 seconds]
<DavidHeidelberg[m]>
Marge bot looks dead, anyone seeing her working in any MR?
<jenatali>
David Heidelberg: Apparently it still waited for the timeout on mine, even after I merged it because she didn't see the pipeline making progress