ChanServ changed the topic of #freedesktop to: infrastructure and online services || for questions about projects, please see each project's contact || for discussions about specifications, please use or
danvet has quit [Ping timeout: 480 seconds]
Haaninjo has quit [Quit: Ex-Chat]
agd5f_ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
ximion has quit []
airlied has joined #freedesktop
<airlied> what is up with all the 503 ci-fairy errors in mesa CI?
damian has joined #freedesktop
<airlied> I've been trying to land that MR for a week and it has failed at least across 10 jobs every time
<airlied> in stuff unrelated to the cts being revved
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
itoral has joined #freedesktop
jarthur has quit [Quit: Textual IRC Client:]
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<bentiss> airlied: I can see that the radosgw that handles the file uploads is getting OOMKilled at roughly the time of the job (last was at Tue, 17 Jan 2023 05:34:33 +0100)
<bentiss> the pod is on the node where the backup happens, and it has Memory limits that might be a little bit too low
<bentiss> I'll try to do something about it
<airlied> bentiss: yeah I lost a lot of jobs to 503 around that time, maybe it's just my timezone gets smashed due to the backups :-P
<bentiss> airlied: that is likely to happen, but still I do not see a big memory usage for this pod compared to the others, so getting OOMKilled is suspicious
<bentiss> yep, it definitely get killed at 2GB of usage, which is a bit low given the size of the uploads
<bentiss> airlied: I have set the memory limits to 10 GB, we'll see if it behaves better
<daniels> bentiss, mupuf: hm, seems like this will be hitting us really soon
* bentiss looks
<bentiss> daniels: BTW I checked over the week end that the registry gc was working. And it does :)
<bentiss> the only weird thing is that if a blob is already in the db, it can take up to 48h +- 3h to get cleaned up
<bentiss> but it eventually get cleaned
danvet has joined #freedesktop
<daniels> bentiss: awesome!
<bentiss> daniels: still no luck at migrating existing repos to the new db though FWIW
<mupuf> daniels: thanks for the headsup!
<mupuf> bentiss: that is wonderful news! 48h is not an issue when we would probably want to keep images for 30 days or so
<bentiss> mupuf: yeah, as long as we do not keep images forever, even a week or a month would have been fine ;)
<mupuf> exactly :)
* mupuf is checking out if the new runner interface is compatible with automatic creation of runners
<mupuf> but I love the fact that runners would be associated to users!
<bentiss> daniels, mupuf: re gitlab-runner on rootless podman: the problem I have now is that I can't leverage kvm on the runners, which is problematic
<bentiss> I think I can do it on my local fedora, so I must be missing something
<mupuf> yeah, you probably are: is the podman user in the kvm group?
<bentiss> yes
<mupuf> and next: is the gitlab runner using --privileged? If not, did you add -v /dev/kvm:/dev/kvm?
<bentiss> I tried both IIRC and both were failing
<mupuf> weird...
<bentiss> yeah. Well I was doing the tests yesterday evening, so I must have missed something
<mupuf> the thing with fedora is that it was tweaked to work nicely in containers :s
<bentiss> heh, yeah :)
<mupuf> so, yeah for doing the work, but then we need to figure out what we need to do to make it work on other distros :D
<mupuf> at least, reading the dockerfile of the podman or fedora images helped me
<mupuf> (in the past)
<mupuf> bentiss: this may be of interest:
<bentiss> so, on fedora (as a user): "podman run --rm -ti --device /dev/kvm" works fine
* mupuf is checking it out on arch
<mupuf> any simple command I could run to check if kvm works?
<mupuf> or should I just check permissions in /dev?
<bentiss> mupuf: once in that container above, run /app/vmctl start
<bentiss> permissions of /dev/kvm are useless because they are mapped
<mupuf> yeah
<mupuf> seems to have worked
* bentiss did a chmod o+rw /dev/kvm on the equinix host
<mupuf> qemu did not complain about kvm, and ssh worked
<mupuf> bentiss: what's the current rights on /dev/kvm on the host?
<mupuf> crw-rw-rw- 1 root kvm 10, 232 Jan 17 10:26 /dev/kvm --> this is what I have on arch
<bentiss> before: crw-rw---- 1 root kvm 10, 232 Jan 17 07:54 /dev/kvm
<bentiss> and yes, changing the o+rw works out right
<bentiss> I guess I'll just add that permissions to /dev/kvm, and move on
<bentiss> the benefits of using rootless are much better that trying to fix that issue on podman
<mupuf> for sure!
<mupuf> thanks for working on this :)
<bentiss> hopefully this will reduce the costs (because I won't have to wipe the registry cache every week)
<mupuf> why is wiping the registry costing?
<mupuf> costing anything*
<mupuf> Do you have to create a new volume, then remove the old one?
<bentiss> I am wiping the registry *cache*
<bentiss> so every Monday, we rebuild it and pull data from GCS
<bentiss> and I can not switch to harbor caching right now because docker doesn't allow me to rewrite the mirror. I was overwriting the IP address of the registry to force use the cache :)
<bentiss> so if I switch to podamn, I can use rootless containers AND harbor as a cache, which has smart gc and which will delete blobs unused in the past 7 days
<bentiss> so in theory, no more wipe of the cache
* bentiss needs to check if network-manager is happy with rootless podman (and we probably need it to not be happy)
<bentiss> looks like is using the kvm feature but is not tagged as such... It shouldn't be an issue but still
<bentiss> alright, so networkmanager is happy with rootless podman.
<bentiss> maybe I should test virgl
<bentiss> ERROR - dEQP error: libEGL warning: failed to open /dev/dri/card0: Permission denied -> yep, it's not happy :)
<bentiss> and 0: failed to open virtual socket device /dev/vhost-vsock
<bentiss> that's weird, if we give root access to people, they tend to use it :(
<mupuf> :D
feto_bastardo has quit [Quit: quit]
AbleBacon has quit [Quit: I am like MacArthur; I shall return.]
feto_bastardo has joined #freedesktop
mvlad has joined #freedesktop
<bentiss> sigh, I can´t make virgl pass in rottless environment
<bentiss> rootless
<bentiss> and it's weird... I get a much longer time in the jobs when using podman
<bentiss> (even as root)
<bentiss> well, that was on the host
<bentiss> same magnitude
<mupuf> bentiss: there must be some acceleration missing :o
<bentiss> yeah, or I wonder if the container sees the real cpu
<mupuf> do you have load statistics on the host?
<mupuf> AFAIK, no namespaces allow you to hide that
<bentiss> mupuf: the machine is pretty idle
<bentiss> I mean outside of the 4 crosvm cores in use for the 2 jobs
<mupuf> can you tell if one of them is using more ressources than the other?
<bentiss> htop shows that all cores are not used but the 4 ones I mentioned so they have plenty of space
<mupuf> weird...
<bentiss> Load average: 2.39 on a 64 core machine
<bentiss> anyway, I'll check on it after lunch
<mupuf> bentiss: have a good one!
<bentiss> thanks
<daniels> that's odd, virgl is just using llvmpipe in the host (or certainly should be ...)
<daniels> but yeah, sadly the vsock stuff is needed to properly communicate with the guest, unless you have a better idea of how to do it?
Leopold_ has quit [Remote host closed the connection]
Leopold has joined #freedesktop
ppascher has quit [Quit: Gateway shutdown]
<bentiss> alright, testing the same job with root docker... and the results are similar. I wonder why this machine is slower than m3l-9
kxkamil has joined #freedesktop
<mupuf> bentiss, daniels: gitlab is returning 500
<bentiss> mupuf: link?
<mupuf> but indeed, some others are fine
<mupuf> and the registry also seems to be affected
<bentiss> so it might be a ceph issue
<bentiss> or... the db disk full
<bentiss> yeah, 98.26% full, let me bump it
<mupuf> well, seems to have instantly helped :)
___nick___ has joined #freedesktop
<bentiss> indeed :)
<mupuf> thanks a lot!
<bentiss> no worries :)
<bentiss> so... I reverted all of the newly installed packages, rebooting m3l-10, and I'll see if there is any improvement
itoral has quit [Remote host closed the connection]
<bentiss> and if it doesn't work I guess I'll just throw away m3l-10 and create m3l-11
<bentiss> alright, deploying m3l-11 because 10 seems to take twice the time than the other
___nick___ has quit []
agd5f_ has joined #freedesktop
___nick___ has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
<bentiss> brand new runner, and not necessary faster... the only difference now is the kernel and gitlab-runner
<mupuf> bentiss: :s
<mupuf> Not sure if I would prefer it being a gitlab runner or kernel issue...
<mupuf> well, I would prefer gitlab runner, easier to downgrade and maintain
<bentiss> well, I don't see how gitlab-runner could be involved, unless they changed how they spin up their containers
<bentiss> actually, one thing I haven't checkd is the virglrenderer script, maybe they throttle it for non MR
<daniels> bentiss: we don't throttle based on MRs, no
<daniels> bentiss: we do throttle on $FDO_CI_CONCURRENT tho
<daniels> well, not throttle, but
<bentiss> shit, how to spend a day: FDO_CI_CONCURRENT=1 on my test runner :(
<bentiss> how to *lose*
<daniels> oh no ...
<bentiss> daniels: do I keep m3l-10 or 11?
<daniels> no coffee this morning?
<daniels> bentiss: if 11 works then keep 11
* bentiss doesn't drink coffee
<bentiss> daniels: OK
<bentiss> FWIW, I took the gitlab-runner config from 10 (after stopping the gitlab-runner service) so all current runners are still there (placeholder and whatnot)
<daniels> ooh, nice
<bentiss> I don't understand why the gitlab-runner starts a cerbero job that is not in the list of running jobs and the global runner is disabled for this machine :(
<bentiss> anyway... let's try with podamn (as root) and FDO_CONCURRENT set to 8 now
<bentiss> and it does accelerate the whole thing, finally, thanks daniels for pointing this one out, I read the config file too many times thinking they are identical
<bentiss> daniels: so current plan: use podman as root for all gitlab runners, and force the use of harbor.fd.o as the registry cache (it might also solve the need for registry-direct.fd.o)
<bentiss> daniels: then continue the tests for rootless podman
<daniels> bentiss: sounds like a great plan :)
<bentiss> daniels: and kill with fire registry-mirror too :)
<daniels> \o/
<bentiss> mupuf: ^^ we need an update on vm2c for that btw, we need s/\/cache//'
<mupuf> bentiss: ... I feel for you :s
<bentiss> well, that's only on me that one :(
<mupuf> I can work on the vm2c update!
<bentiss> thanks
<bentiss> daniels: can you explain what is the purpose of placeholder-equinix?
<bentiss> because I probably killed 2 gstreamer piplines by killing jobs that were triggered on this runner ;)
<daniels> bentiss: we can probably start getting rid of that now - we used it when we didn't have the ability to trigger dependent pipelines, so when e.g. virglrenderer wants to trigger a mesa CI pipeline, it can do that without occupying a full slot - same for gst which kicks off dependent builds via cerbero, polls on the status, then reports back to the parent
<bentiss> oh, OK
<daniels> (there are also some scheduled jobs for mesa etc which just run some pretty trivial python processing to shove data into Grafana and shouldn't occupy a full slot)
<daniels> basically, stuff that can run in the background with zero effect
<bentiss> k, thanks
<__tim> ah ok, so that's what happened to those jobs
<__tim> we were a bit puzzled :)
<bentiss> __tim: sorry, I had to kill gitlab-runner, and this was blocking me to do it
<bentiss> __tim: I just changed the config on that machine and should not have to re-change it in the short term, so you might want to restart it
<__tim> ok, thanks
<bentiss> well, I haven't prune the disk, but I guess I can live with the symbolic link
<__tim> on a side note, what's puzzling me is that this job seems to *always* download the image again, every time, e.g.
<bentiss> weird indeed
<bentiss> so I botched the podman config, placeholder-equinix might not be in a good shape
<__tim> no, doesn't look like it 😆
<bentiss> __tim: it does :)
<bentiss> rebooting m3l-11 because it ended up in a bad shape
ximion has joined #freedesktop
<robclark> hmm, CI doesn't seem to be super happy right now.. getting a bunch of 500's like
<bentiss> robclark: is it just for windows jobs?
<daniels> bentiss: hmmmm, something weird with -11
<bentiss> daniels: might be the registries.conf file
<alanc> I'm getting failures from Linux jobs too:
<alanc> ERROR: Job failed (system failure): Error response from daemon: container create: unable to find '' in local storage: no such image (docker.go:534:0s)
<bentiss> alanc: yes, I have seen that. I have disabled m3l-11 for now
<bentiss> and checking on m3l-12 ATM
<alanc> ah, my pipeline just passed, thanks!
<bentiss> alanc: it went on a different runner :)
<alanc> nothing like a little unexpected failure to add to the stress of pushing out a security fix release/advisory 8-)
<bentiss> sorry :(
<alanc> you fixed it, nothing to apologize for there
<bentiss> daniels: I might as well nuke m3l-11 and start from scratch again like I am doing with the rest of the flock
<robclark> bentiss: it was a bunch of random jobs.. but I _think_ it has recovered
<daniels> bentiss: sounds good, thanks :)
<bentiss> daniels: more like I'm going to move the placeholder-equinix to m3l-12, and then nuke m3l-11 and we'll monitor how things are going before migrating 8 and 9
<daniels> ++
<bentiss> it came way faster than expected
<daniels> :(
<daniels> the pull succeeds, but then it can't find the thing it just pulled ...
<bentiss> I was trying to use podman from debian:stable, maybe I should use podman 4
<bentiss> and FTR I paused m3l-12
<daniels> thankyou
<bentiss> upgraded m3l-11 to podman 4.3 and resumed gitlab-runner
<bentiss> daniels: I have nuked m3l-12, I'll rebuild it tomorrow if m3l-11 is fine
<daniels> thanks! I'll try to keep an eye on jobs this evening
<bentiss> sigh, I wish we have a way to distinguish real code failure vs podman error in the gitlab UI
<mupuf> Yeah, it would be nice to have an infrafail status
<daniels> it does get recorded, just not shown in the UI
jarthur has joined #freedesktop
<bentiss> alright, seems to be fine for the past hour. Though -> igt-gpu-tools can't rebuild new images
<bentiss> we need to try ci-templates
<daniels> hmmmmm, I think dbus also uses docker directly rather than ci-templates
<daniels> 'hey, we asked nicely two years ago, but now this is going to stop working soon' would be good motivation tho :)
<bentiss> heh
<bentiss> looks like ci-templates still works
<bentiss> daniels: actually it was 3 years ago IIRC, I had 2 xdc presentations about it and I wasn't there this year
<daniels> ci-templates \o/
ElectricJozin has joined #freedesktop
<ElectricJozin> How can I view the session bus with busctl?
ElectricJozin has quit []
ElectricJozin has joined #freedesktop
ElectricJozin has quit [Remote host closed the connection]
ElectricJozin has joined #freedesktop
ElectricJozin has quit []
ElectricJozin has joined #freedesktop
<ElectricJozin> How can I view the session bus with busctl?
AbleBacon has joined #freedesktop
ElectricJozin has quit [Quit: Leaving]
<mupuf> The problem with ci templates and IGT is that it isn't solving anything for it. IGT wants to build a container per push for simplicity and reproducibility. Now that harbour is in place, isn't it a perfectly acceptable solution?
<mupuf> Just to be clear, every push generates one new layer on top of the base container
<mupuf> And the base container is seldomly regenerated
<mupuf> So each push generates something like 5MB stored in a registry rather than in gitlab's artifact
<mupuf> What I think IGT should do is use buildah rather than docker
<mupuf> As for the base container, they can be generated using ci templates, or buildah. It shouldn't matter
<mupuf> I can take a crack at it tomorrow, if you want
<DavidHeidelberg[m]> How it's the LAVA tag passed as a tag to the Gitlab CI ?
<DavidHeidelberg[m]> exactly, the "devicetype" is passed as a "tag" in GitlabCI
<daniels> DavidHeidelberg[m]: yeah, each LAVA device tag has a matching GitLab CI tag
<DavidHeidelberg[m]> but how it gets propagated? Since I seems to not see the new device
<daniels> ah, yes
<daniels> I'll set that up in the morning - it's not automatic, you have to log in to brea and register a new runner type
<DavidHeidelberg[m]> thanks or if you want to, I can do it (if you can give me some rights for this task)
<DavidHeidelberg[m]> (ofc tomorrow)
Haaninjo has joined #freedesktop
<daniels> yeah, I can walk you through it tomorrow
ybogdano has joined #freedesktop
<jenatali> - looks like Marge picked the wrong pipeline?
<jenatali> When that happens, is there harm in hitting the manual merge button?
<anholt> she's still on that MR, and it looks good to me. ack.
<jenatali> Good enough for me
<jenatali> Which, harkening back to the convo about permissions for main, is a good reason to keep allowing devs access, until that bug is fixed
___nick___ has quit [Ping timeout: 480 seconds]
mvlad has quit [Remote host closed the connection]
ybogdano has quit [Ping timeout: 480 seconds]
<DavidHeidelberg[m]> Marge bot looks dead, anyone seeing her working in any MR?
<jenatali> David Heidelberg: Apparently it still waited for the timeout on mine, even after I merged it because she didn't see the pipeline making progress
<DavidHeidelberg[m]> she seems to be stuck here: ?
<DavidHeidelberg[m]> ...and it's working. Magic.
<jenatali> David Heidelberg: I just got a notification that she "gave up" on !20617
<DavidHeidelberg[m]> I think Marge deserves some investments (in terms of fixing her code a bit)
<jenatali> Yeah that pipeline race is nasty
ybogdano has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
ybogdano has quit [Ping timeout: 480 seconds]
ybogdano has joined #freedesktop
karolherbst has quit [Remote host closed the connection]
karolherbst has joined #freedesktop
ybogdano has quit [Ping timeout: 480 seconds]
ybogdano has joined #freedesktop