#freedesktop on 2023-01-17 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:45 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:18 danvet has quit [Ping timeout: 480 seconds]

00:28 Haaninjo has quit [Quit: Ex-Chat]

01:40 agd5f_ has joined #freedesktop

01:46 agd5f has quit [Ping timeout: 480 seconds]

02:48 ximion has quit []

04:44 airlied has joined #freedesktop

04:44 <airlied> what is up with all the 503 ci-fairy errors in mesa CI?

04:45 <airlied> https://gitlab.freedesktop.org/mesa/mesa/-/jobs/34843794

04:45 damian has joined #freedesktop

04:47 <airlied> I've been trying to land that MR for a week and it has failed at least across 10 jobs every time

04:47 <airlied> in stuff unrelated to the cts being revved

05:01 Leopold_ has quit [Remote host closed the connection]

05:07 Leopold_ has joined #freedesktop

05:09 agd5f has joined #freedesktop

05:15 agd5f_ has quit [Ping timeout: 480 seconds]

06:27 itoral has joined #freedesktop

07:00 jarthur has quit [Quit: Textual IRC Client: www.textualapp.com]

07:30 alanc has quit [Remote host closed the connection]

07:31 alanc has joined #freedesktop

07:41 <bentiss> airlied: I can see that the radosgw that handles the file uploads is getting OOMKilled at roughly the time of the job (last was at Tue, 17 Jan 2023 05:34:33 +0100)

07:42 <bentiss> the pod is on the node where the backup happens, and it has Memory limits that might be a little bit too low

07:42 <bentiss> I'll try to do something about it

07:45 <airlied> bentiss: yeah I lost a lot of jobs to 503 around that time, maybe it's just my timezone gets smashed due to the backups :-P

07:46 <bentiss> airlied: that is likely to happen, but still I do not see a big memory usage for this pod compared to the others, so getting OOMKilled is suspicious

07:47 <bentiss> yep, it definitely get killed at 2GB of usage, which is a bit low given the size of the uploads

08:04 <bentiss> airlied: I have set the memory limits to 10 GB, we'll see if it behaves better

08:07 <daniels> bentiss, mupuf: hm, seems like this will be hitting us really soon https://docs.gitlab.com/ee/architecture/blueprints/runner_tokens/

08:08 * bentiss looks

08:08 <bentiss> daniels: BTW I checked over the week end that the registry gc was working. And it does :)

08:09 <bentiss> the only weird thing is that if a blob is already in the db, it can take up to 48h +- 3h to get cleaned up

08:09 <bentiss> but it eventually get cleaned

08:10 danvet has joined #freedesktop

08:12 <daniels> bentiss: awesome!

08:13 <bentiss> daniels: still no luck at migrating existing repos to the new db though FWIW

08:13 <mupuf> daniels: thanks for the headsup!

08:14 <mupuf> bentiss: that is wonderful news! 48h is not an issue when we would probably want to keep images for 30 days or so

08:14 <bentiss> mupuf: yeah, as long as we do not keep images forever, even a week or a month would have been fine ;)

08:15 <mupuf> exactly :)

08:15 * mupuf is checking out if the new runner interface is compatible with automatic creation of runners

08:15 <mupuf> but I love the fact that runners would be associated to users!

08:17 <bentiss> daniels, mupuf: re gitlab-runner on rootless podman: the problem I have now is that I can't leverage kvm on the runners, which is problematic

08:17 <bentiss> I think I can do it on my local fedora, so I must be missing something

08:17 <mupuf> yeah, you probably are: is the podman user in the kvm group?

08:18 <bentiss> yes

08:18 <mupuf> and next: is the gitlab runner using --privileged? If not, did you add -v /dev/kvm:/dev/kvm?

08:18 <bentiss> I tried both IIRC and both were failing

08:20 <mupuf> weird...

08:20 <bentiss> yeah. Well I was doing the tests yesterday evening, so I must have missed something

08:20 <mupuf> the thing with fedora is that it was tweaked to work nicely in containers :s

08:21 <bentiss> heh, yeah :)

08:21 <mupuf> so, yeah for doing the work, but then we need to figure out what we need to do to make it work on other distros :D

08:21 <mupuf> at least, reading the dockerfile of the podman or fedora images helped me

08:21 <mupuf> (in the past)

08:22 <mupuf> bentiss: this may be of interest: https://github.com/coreos/coreos-assembler/issues/2501

08:23 <bentiss> so, on fedora (as a user): "podman run --rm -ti --device /dev/kvm registry.freedesktop.org/freedesktop/ci-templates/qemu-build-base:2022-08-22.0" works fine

08:23 * mupuf is checking it out on arch

08:24 <mupuf> any simple command I could run to check if kvm works?

08:24 <mupuf> or should I just check permissions in /dev?

08:24 <bentiss> mupuf: once in that container above, run /app/vmctl start

08:24 <bentiss> permissions of /dev/kvm are useless because they are mapped

08:25 <mupuf> yeah

08:25 <mupuf> seems to have worked

08:26 * bentiss did a chmod o+rw /dev/kvm on the equinix host

08:26 <mupuf> qemu did not complain about kvm, and ssh worked

08:27 <mupuf> bentiss: what's the current rights on /dev/kvm on the host?

08:27 <mupuf> crw-rw-rw- 1 root kvm 10, 232 Jan 17 10:26 /dev/kvm --> this is what I have on arch

08:27 <bentiss> before: crw-rw---- 1 root kvm 10, 232 Jan 17 07:54 /dev/kvm

08:27 <bentiss> and yes, changing the o+rw works out right

08:28 <bentiss> I guess I'll just add that permissions to /dev/kvm, and move on

08:29 <bentiss> the benefits of using rootless are much better that trying to fix that issue on podman

08:29 <mupuf> for sure!

08:30 <mupuf> thanks for working on this :)

08:32 <bentiss> hopefully this will reduce the costs (because I won't have to wipe the registry cache every week)

08:36 <mupuf> why is wiping the registry costing?

08:36 <mupuf> costing anything*

08:36 <mupuf> Do you have to create a new volume, then remove the old one?

08:36 <bentiss> I am wiping the registry *cache*

08:37 <bentiss> so every Monday, we rebuild it and pull data from GCS

08:38 <bentiss> and I can not switch to harbor caching right now because docker doesn't allow me to rewrite the mirror. I was overwriting the IP address of the registry to force use the cache :)

08:38 <bentiss> so if I switch to podamn, I can use rootless containers AND harbor as a cache, which has smart gc and which will delete blobs unused in the past 7 days

08:39 <bentiss> so in theory, no more wipe of the cache

08:39 * bentiss needs to check if network-manager is happy with rootless podman (and we probably need it to not be happy)

09:04 <bentiss> looks like https://gitlab.freedesktop.org/virgl/virglrenderer/-/jobs/34849418 is using the kvm feature but is not tagged as such... It shouldn't be an issue but still

09:10 <bentiss> alright, so networkmanager is happy with rootless podman.

09:10 <bentiss> maybe I should test virgl

09:32 <bentiss> ERROR - dEQP error: libEGL warning: failed to open /dev/dri/card0: Permission denied -> yep, it's not happy :)

09:32 <bentiss> and 0: failed to open virtual socket device /dev/vhost-vsock

09:32 <bentiss> that's weird, if we give root access to people, they tend to use it :(

09:45 <mupuf> :D

10:15 feto_bastardo has quit [Quit: quit]

10:33 AbleBacon has quit [Quit: I am like MacArthur; I shall return.]

10:40 feto_bastardo has joined #freedesktop

10:52 mvlad has joined #freedesktop

11:03 <bentiss> sigh, I can´t make virgl pass in rottless environment

11:03 <bentiss> rootless

11:05 <bentiss> and it's weird... I get a much longer time in the jobs when using podman

11:05 <bentiss> (even as root)

11:06 <bentiss> for example: https://gitlab.freedesktop.org/bentiss/virglrenderer/-/jobs/34852950 -> 6467 in ~4 min, when https://gitlab.freedesktop.org/virgl/virglrenderer/-/jobs/34849407 does 61037 (~10x faster)

11:06 <bentiss> well, that was on the host

11:07 <bentiss> still 69470 for https://gitlab.freedesktop.org/virgl/virglrenderer/-/jobs/34849412

11:07 <bentiss> same magnitude

11:10 <mupuf> bentiss: there must be some acceleration missing :o

11:10 <bentiss> yeah, or I wonder if the container sees the real cpu

11:10 <mupuf> do you have load statistics on the host?

11:11 <mupuf> AFAIK, no namespaces allow you to hide that

11:11 <bentiss> mupuf: the machine is pretty idle

11:11 <bentiss> I mean outside of the 4 crosvm cores in use for the 2 jobs

11:12 <mupuf> can you tell if one of them is using more ressources than the other?

11:12 <bentiss> htop shows that all cores are not used but the 4 ones I mentioned so they have plenty of space

11:13 <mupuf> weird...

11:13 <bentiss> Load average: 2.39 on a 64 core machine

11:13 <bentiss> anyway, I'll check on it after lunch

11:13 <mupuf> bentiss: have a good one!

11:14 <bentiss> thanks

11:18 <daniels> that's odd, virgl is just using llvmpipe in the host (or certainly should be ...)

11:18 <daniels> but yeah, sadly the vsock stuff is needed to properly communicate with the guest, unless you have a better idea of how to do it?

11:32 Leopold_ has quit [Remote host closed the connection]

11:33 Leopold has joined #freedesktop

12:23 ppascher has quit [Quit: Gateway shutdown]

12:37 <bentiss> alright, testing the same job with root docker... and the results are similar. I wonder why this machine is slower than m3l-9

12:40 kxkamil has joined #freedesktop

13:12 <mupuf> bentiss, daniels: gitlab is returning 500

13:14 <bentiss> mupuf: link?

13:14 <mupuf> https://gitlab.freedesktop.org/mupuf/dxvk-ci/-/pipelines/785406, https://gitlab.freedesktop.org/api/v4/projects/8765/packages/generic/release/v0.9.8/linux-x86_64

13:14 <mupuf> but indeed, some others are fine

13:16 <mupuf> and the registry also seems to be affected

13:16 <bentiss> so it might be a ceph issue

13:17 <bentiss> or... the db disk full

13:18 <bentiss> yeah, 98.26% full, let me bump it

13:19 <mupuf> well, seems to have instantly helped :)

13:19 ___nick___ has joined #freedesktop

13:19 <bentiss> indeed :)

13:21 <mupuf> thanks a lot!

13:23 <bentiss> no worries :)

13:32 <bentiss> so... I reverted all of the newly installed packages, rebooting m3l-10, and I'll see if there is any improvement

13:40 itoral has quit [Remote host closed the connection]

13:40 <bentiss> and if it doesn't work I guess I'll just throw away m3l-10 and create m3l-11

13:59 <bentiss> alright, deploying m3l-11 because 10 seems to take twice the time than the other

14:02 ___nick___ has quit []

14:02 agd5f_ has joined #freedesktop

14:04 ___nick___ has joined #freedesktop

14:04 ___nick___ has quit []

14:06 ___nick___ has joined #freedesktop

14:08 agd5f has quit [Ping timeout: 480 seconds]

14:08 agd5f has joined #freedesktop

14:12 agd5f_ has quit [Ping timeout: 480 seconds]

14:21 <bentiss> brand new runner, and not necessary faster... the only difference now is the kernel and gitlab-runner

14:22 <mupuf> bentiss: :s

14:22 <mupuf> Not sure if I would prefer it being a gitlab runner or kernel issue...

14:22 <mupuf> well, I would prefer gitlab runner, easier to downgrade and maintain

14:23 <bentiss> well, I don't see how gitlab-runner could be involved, unless they changed how they spin up their containers

14:24 <bentiss> actually, one thing I haven't checkd is the virglrenderer script, maybe they throttle it for non MR

14:41 <daniels> bentiss: we don't throttle based on MRs, no

14:41 <daniels> bentiss: we do throttle on $FDO_CI_CONCURRENT tho

14:41 <daniels> well, not throttle, but

14:42 <bentiss> shit, how to spend a day: FDO_CI_CONCURRENT=1 on my test runner :(

14:42 <bentiss> how to *lose*

14:42 <daniels> oh no ...

14:42 <bentiss> daniels: do I keep m3l-10 or 11?

14:42 <daniels> no coffee this morning?

14:43 <daniels> bentiss: if 11 works then keep 11

14:43 * bentiss doesn't drink coffee

14:43 <bentiss> daniels: OK

14:44 <bentiss> FWIW, I took the gitlab-runner config from 10 (after stopping the gitlab-runner service) so all current runners are still there (placeholder and whatnot)

14:46 <daniels> ooh, nice

14:52 <bentiss> I don't understand why the gitlab-runner starts a cerbero job that is not in the list of running jobs and the global runner is disabled for this machine :(

14:53 <bentiss> anyway... let's try with podamn (as root) and FDO_CONCURRENT set to 8 now

14:56 <bentiss> and it does accelerate the whole thing, finally, thanks daniels for pointing this one out, I read the config file too many times thinking they are identical

14:57 <bentiss> daniels: so current plan: use podman as root for all gitlab runners, and force the use of harbor.fd.o as the registry cache (it might also solve the need for registry-direct.fd.o)

14:57 <bentiss> daniels: then continue the tests for rootless podman

14:59 <daniels> bentiss: sounds like a great plan :)

14:59 <bentiss> daniels: and kill with fire registry-mirror too :)

15:00 <daniels> \o/

15:00 <bentiss> mupuf: ^^ we need an update on vm2c for that btw, we need s/registry-mirror.freedesktop.org/harbor.freedesktop.org\/cache//'

15:01 <mupuf> bentiss: ... I feel for you :s

15:02 <bentiss> well, that's only on me that one :(

15:02 <mupuf> I can work on the vm2c update!

15:02 <bentiss> thanks

15:20 <bentiss> daniels: can you explain what is the purpose of placeholder-equinix?

15:21 <bentiss> because I probably killed 2 gstreamer piplines by killing jobs that were triggered on this runner ;)

15:22 <daniels> bentiss: we can probably start getting rid of that now - we used it when we didn't have the ability to trigger dependent pipelines, so when e.g. virglrenderer wants to trigger a mesa CI pipeline, it can do that without occupying a full slot - same for gst which kicks off dependent builds via cerbero, polls on the status, then reports back to the parent

15:22 <bentiss> oh, OK

15:22 <daniels> (there are also some scheduled jobs for mesa etc which just run some pretty trivial python processing to shove data into Grafana and shouldn't occupy a full slot)

15:23 <daniels> basically, stuff that can run in the background with zero effect

15:23 <bentiss> k, thanks

15:24 <__tim> ah ok, so that's what happened to those jobs

15:24 <__tim> we were a bit puzzled :)

15:24 <bentiss> __tim: sorry, I had to kill gitlab-runner, and this was blocking me to do it

15:26 <bentiss> __tim: I just changed the config on that machine and should not have to re-change it in the short term, so you might want to restart it

15:27 <__tim> ok, thanks

15:28 <bentiss> well, I haven't prune the disk, but I guess I can live with the symbolic link

15:28 <__tim> on a side note, what's puzzling me is that this job seems to *always* download the image again, every time, e.g. https://gitlab.freedesktop.org/thiblahute/gstreamer/-/jobs/34868835

15:31 <bentiss> weird indeed

15:32 <bentiss> so I botched the podman config, placeholder-equinix might not be in a good shape

15:32 <__tim> no, doesn't look like it 😆

15:33 <bentiss> __tim: it does :)

15:46 <bentiss> rebooting m3l-11 because it ended up in a bad shape

15:51 ximion has joined #freedesktop

15:57 <robclark> hmm, CI doesn't seem to be super happy right now.. getting a bunch of 500's like https://gitlab.freedesktop.org/mesa/mesa/-/jobs/34869791

16:13 <bentiss> robclark: is it just for windows jobs?

16:21 <daniels> bentiss: hmmmm, something weird with -11 https://gitlab.freedesktop.org/pH5/weston/-/jobs/34874535

16:22 <bentiss> daniels: might be the registries.conf file

16:24 <alanc> I'm getting failures from Linux jobs too: https://gitlab.freedesktop.org/alanc/libxpm/-/jobs/34874938

16:28 <alanc> ERROR: Job failed (system failure): Error response from daemon: container create: unable to find 'docker.io/library/sha256:1ce0b2a729de804418dc7e2ac5f894e099b056c77ddd054a1ea0cf20cb685b00' in local storage: no such image (docker.go:534:0s)

16:28 <bentiss> alanc: yes, I have seen that. I have disabled m3l-11 for now

16:29 <bentiss> and checking on m3l-12 ATM

16:29 <alanc> ah, my pipeline just passed, thanks!

16:29 <bentiss> alanc: it went on a different runner :)

16:30 <alanc> nothing like a little unexpected failure to add to the stress of pushing out a security fix release/advisory 8-)

16:30 <bentiss> sorry :(

16:30 <alanc> you fixed it, nothing to apologize for there

16:35 <bentiss> daniels: I might as well nuke m3l-11 and start from scratch again like I am doing with the rest of the flock

16:35 <robclark> bentiss: it was a bunch of random jobs.. but I _think_ it has recovered

16:35 <daniels> bentiss: sounds good, thanks :)

16:37 <bentiss> daniels: more like I'm going to move the placeholder-equinix to m3l-12, and then nuke m3l-11 and we'll monitor how things are going before migrating 8 and 9

16:37 <daniels> ++

16:38 <bentiss> damn https://gitlab.freedesktop.org/wayland/weston/-/jobs/34877638 and https://gitlab.freedesktop.org/mesa/mesa/-/jobs/34876514

16:38 <bentiss> it came way faster than expected

16:39 <daniels> :(

16:39 <daniels> the pull succeeds, but then it can't find the thing it just pulled ...

16:39 <bentiss> I was trying to use podman from debian:stable, maybe I should use podman 4

16:40 <bentiss> and FTR I paused m3l-12

16:40 <daniels> thankyou

16:51 <bentiss> upgraded m3l-11 to podman 4.3 and resumed gitlab-runner

16:53 <bentiss> daniels: I have nuked m3l-12, I'll rebuild it tomorrow if m3l-11 is fine

16:56 <daniels> thanks! I'll try to keep an eye on jobs this evening

16:58 <bentiss> sigh, I wish we have a way to distinguish real code failure vs podman error in the gitlab UI

17:05 <mupuf> Yeah, it would be nice to have an infrafail status

17:05 <daniels> it does get recorded, just not shown in the UI

17:16 jarthur has joined #freedesktop

17:56 <bentiss> alright, seems to be fine for the past hour. Though https://gitlab.freedesktop.org/mwa/igt-gpu-tools/-/jobs/34883475 -> igt-gpu-tools can't rebuild new images

17:56 <bentiss> we need to try ci-templates

18:00 <daniels> hmmmmm, I think dbus also uses docker directly rather than ci-templates

18:00 <daniels> 'hey, we asked nicely two years ago, but now this is going to stop working soon' would be good motivation tho :)

18:02 <bentiss> heh

18:02 <bentiss> looks like ci-templates still works

18:02 <bentiss> https://gitlab.freedesktop.org/bentiss/ci-templates/-/jobs/34884098

18:04 <bentiss> daniels: actually it was 3 years ago IIRC, I had 2 xdc presentations about it and I wasn't there this year

18:12 <daniels> ci-templates \o/

18:33 ElectricJozin has joined #freedesktop

18:35 <ElectricJozin> How can I view the session bus with busctl?

18:35 ElectricJozin has quit []

18:36 ElectricJozin has joined #freedesktop

18:36 ElectricJozin has quit [Remote host closed the connection]

18:36 ElectricJozin has joined #freedesktop

18:37 ElectricJozin has quit []

18:38 ElectricJozin has joined #freedesktop

18:38 <ElectricJozin> How can I view the session bus with busctl?

18:46 AbleBacon has joined #freedesktop

18:48 ElectricJozin has quit [Quit: Leaving]

19:21 <mupuf> The problem with ci templates and IGT is that it isn't solving anything for it. IGT wants to build a container per push for simplicity and reproducibility. Now that harbour is in place, isn't it a perfectly acceptable solution?

19:22 <mupuf> Just to be clear, every push generates one new layer on top of the base container

19:22 <mupuf> And the base container is seldomly regenerated

19:23 <mupuf> So each push generates something like 5MB stored in a registry rather than in gitlab's artifact

19:24 <mupuf> What I think IGT should do is use buildah rather than docker

19:25 <mupuf> As for the base container, they can be generated using ci templates, or buildah. It shouldn't matter

19:27 <mupuf> I can take a crack at it tomorrow, if you want

19:40 <DavidHeidelberg[m]> How it's the LAVA tag passed as a tag to the Gitlab CI ?

19:43 <DavidHeidelberg[m]> exactly, the "devicetype" is passed as a "tag" in GitlabCI

19:43 <daniels> DavidHeidelberg[m]: yeah, each LAVA device tag has a matching GitLab CI tag

19:44 <DavidHeidelberg[m]> but how it gets propagated? Since I seems to not see the new device

19:44 <daniels> ah, yes

19:45 <daniels> I'll set that up in the morning - it's not automatic, you have to log in to brea and register a new runner type

19:46 <DavidHeidelberg[m]> thanks or if you want to, I can do it (if you can give me some rights for this task)

19:46 <DavidHeidelberg[m]> (ofc tomorrow)

19:56 Haaninjo has joined #freedesktop

19:57 <daniels> yeah, I can walk you through it tomorrow

20:15 ybogdano has joined #freedesktop

20:44 <jenatali> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20617 - looks like Marge picked the wrong pipeline?

20:44 <jenatali> When that happens, is there harm in hitting the manual merge button?

20:46 <anholt> she's still on that MR, and it looks good to me. ack.

20:47 <jenatali> Good enough for me

20:49 <jenatali> Which, harkening back to the convo about permissions for main, is a good reason to keep allowing devs access, until that bug is fixed

21:08 ___nick___ has quit [Ping timeout: 480 seconds]

21:08 mvlad has quit [Remote host closed the connection]

21:09 ybogdano has quit [Ping timeout: 480 seconds]

21:17 <DavidHeidelberg[m]> Marge bot looks dead, anyone seeing her working in any MR?

21:19 <jenatali> David Heidelberg: Apparently it still waited for the timeout on mine, even after I merged it because she didn't see the pipeline making progress

21:20 <DavidHeidelberg[m]> she seems to be stuck here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20679 ?

21:20 <DavidHeidelberg[m]> ...and it's working. Magic.

21:20 <jenatali> David Heidelberg: I just got a notification that she "gave up" on !20617

21:21 <DavidHeidelberg[m]> I think Marge deserves some investments (in terms of fixing her code a bit)

21:22 <jenatali> Yeah that pipeline race is nasty

21:31 ybogdano has joined #freedesktop

21:40 danvet has quit [Ping timeout: 480 seconds]

21:59 ybogdano has quit [Ping timeout: 480 seconds]

22:07 ybogdano has joined #freedesktop

22:50 karolherbst has quit [Remote host closed the connection]

23:04 karolherbst has joined #freedesktop

23:15 ybogdano has quit [Ping timeout: 480 seconds]

23:31 ybogdano has joined #freedesktop