#freedesktop on 2021-11-30 — irc logs at oftc.irclog.whitequark.org

2021-10-31 14:16 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:00 <imirkin> e.g. https://gitlab.freedesktop.org/mesa/mesa/-/jobs/16227150 vs https://gitlab.freedesktop.org/mesa/mesa/-/jobs/16226785

00:00 <imirkin> the latter was just hung

00:00 <imirkin> same job, same machine, same everything

00:02 <imirkin> without knowing the setup, feels like there's like a load-balanced firewall, and one of the firewalls just sucks at TCP.

00:02 <__tim> my current theory is is that because gstreamer is being stupid at the moment (it downloads 600MB of media data via git-lfs for every gstreamer integration job, plus possibly also a full couple-of-hundred MB git repo pull), that can lead to multiple jobs downloading lots of stuff in parallel, which then totally kills uploads because there's some

00:02 <__tim> bandwidth constraint/packet loss in the middle somewhere or something *waves hands*

00:03 <__tim> but that may well be complete nonsense

00:03 <imirkin> like a job was stuck for 5 minutes "downloading artifacts". i restarted it, lands on the same runner, and download goes totally fine.

00:15 jstein has joined #freedesktop

01:14 ngcortes has quit [Remote host closed the connection]

01:24 ybogdano has quit [Ping timeout: 480 seconds]

03:02 bnieuwenhuizen has quit [Quit: Bye]

03:03 bnieuwenhuizen has joined #freedesktop

03:39 ximion has quit []

06:22 Seirdy has joined #freedesktop

06:41 danvet has joined #freedesktop

07:33 alanc has quit [Remote host closed the connection]

07:33 alanc has joined #freedesktop

07:55 sunarch has quit []

08:09 thaller has joined #freedesktop

08:13 egbert has joined #freedesktop

08:16 egbert is now known as Guest7210

08:16 egbert has joined #freedesktop

08:23 egbert has quit [Quit: leaving]

08:24 egbert has joined #freedesktop

08:24 egbert has quit []

08:34 Guest7210 has quit []

08:34 egbert has joined #freedesktop

08:38 egbert has quit []

09:02 ximion has joined #freedesktop

09:11 <bentiss> __tim, daniels: FYI, I had a meeting with equinix a couple of weeks ago (followup of my XDC presentation), and I told the guy that some times we were seeing packet loss between packet and hetzner

09:12 <bentiss> __tim, daniels: he told me that if/when this happens, we should give him the output of `mtr` and he'll be able to debug the things on the equinix side

09:14 <bentiss> I am running `mtr minio-packet.freedesktop.org` and it eventually manages to get all the hops in the middle

09:17 <__tim> I'm not sure it actually represents packet loss, I think it might just be routers in the middle dropping/throttling icmp packets, since it shows 0%-ish again for later hops

09:18 <bentiss> __tim: I think what matters is the route the packets are taking, and they can check if there is anything wrong (planned maintainance or sometrhing like that)

09:24 egbert has joined #freedesktop

09:37 pjakobsson has joined #freedesktop

09:46 <__tim> right

09:47 ximion has quit []

09:51 <__tim> bentiss, maybe we should run some iperf tests or such so we have a baseline?

09:52 <bentiss> __tim: sure

09:52 <bentiss> __tim: is minio-packet.fd.o the one having the most issues?

09:52 <bentiss> or the others from the k3s cluster

09:53 <__tim> I don't know tbh

09:53 <__tim> sec

09:58 <__tim> so we're having two problems afaict, one recent, one not so recent

09:58 jstein has quit []

09:59 <bentiss> __tim: which ip should I allow in the firewall for iperf3?

09:59 <__tim> the recent problem is that artefact upload stalls and times out. Not sure when that started happening, perhaps a week or so ago?

09:59 <__tim> 95.217.116.50 + 95.217.116.51

09:59 <bentiss> k, thanks

10:00 <__tim> when that happens the upload seems to putter about at like 20-50 kB/sec, which is clearly ridiculous speed-wise

10:01 <__tim> (just from one or two random observations)

10:02 <bentiss> __tim: I have set up an iperf3 server on minio-packet.freedesktop.org and those 2 ips should be able to contact it (it's running in tmux, hopefully it'll stay up a bit)

10:12 <__tim> [ 5] 0.00-10.00 sec 331 MBytes 277 Mbits/sec 0 sender

10:12 <__tim> [ 5] 0.00-10.09 sec 330 MBytes 274 Mbits/sec receiver

10:21 <__tim> I also have no explanation why e.g. a git clone tops out at 7-8MB/sec (avg 4.5MB/sec) then

10:27 <bentiss> __tim: maybe I should add an iperf3 server on the cluster itself

10:31 <__tim> I get much higher speeds elsewhere so the server seems plenty fast in general

10:33 <bentiss> __tim: try with 147.75.38.77 -> this is an in-cluster IP

10:34 <bentiss> seems fast enough too

10:34 <__tim> let me try from both machines at the same time

10:37 <__tim> seems just fine too, same values (one iperf to in-cluster, the other to minio)

10:38 <bentiss> maybe it's because we are using wireguard internally

10:38 <bentiss> for in cluster communication

10:39 <bentiss> but the CPU doesn't seem to be that used

10:48 <__tim> at the same time, git clone from gnome.gitlab.org is at full speed (I presume that's gc hosted, not sure), so it's not that the client machine is incapable either

10:49 <__tim> it's all rather puzzling and I can't see any explanation for upload starvation to 10s of kB/sec

10:50 <__tim> my theory is still that there's some bottleneck "somewhere in the middle" and we're ending up with upload starvation when there's too much downloading going on

10:51 <bentiss> yep, there is a bottle neck :(

10:52 <bentiss> ideally I wish I could just dump wireguard in the middle but that requires some changes in the infra and a risk...

10:52 <__tim> for giggles I set up a wireguard tunnel to the htz virginia DC (from the machine there I can git clone at 40MB/sec from fdo), and it's showing the same 280Mbit/sec-ish iperf throughput, but then actual git clone again is very pedestrian

10:53 <bentiss> the pb I see with wireguard is that all of our internal communication is encrypted with it, so when you access data from ceph, you end up encrypting it maybe 6 times

10:53 <bentiss> (access to 3 disks on nodes, from the gitaly pod, then forward to webservice, then nginx)

10:54 <bentiss> ok that's 5

10:55 <bentiss> and FWIW, on ceph, we are constantly reading at ~15-20 MB/s, with pikes at 80

10:55 <bentiss> from what I can see on the dashboard

14:20 pendingchaos_ has joined #freedesktop

14:23 pendingchaos has quit [Ping timeout: 480 seconds]

14:29 pendingchaos_ is now known as pendingchaos

14:53 <alatiera> do we use a pre clone script for minio?

14:54 <alatiera> and if so where do the scripts come from

14:55 <daniels> alatiera: a) yes, and b) the project

14:55 <daniels> projects can set $CI_PRE_CLONE_SCRIPT

14:57 <daniels> so the runners have pre_clone_script = "eval \"$CI_PRE_CLONE_SCRIPT\""

14:57 <daniels> and then something like https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/.gitlab-ci.yml#L4

15:05 <alatiera> daniels thanks! that seems like what I remembered looking in the past

15:35 Haaninjo has joined #freedesktop

16:02 jarthur has joined #freedesktop

16:05 jstein has joined #freedesktop

16:05 jstein has quit []

16:15 jarthur has quit [Quit: Textual IRC Client: www.textualapp.com]

16:39 jarthur has joined #freedesktop

17:15 ximion has joined #freedesktop

18:01 ybogdano has joined #freedesktop

18:27 ___nick___ has joined #freedesktop

18:51 jstein has joined #freedesktop

19:10 pinkflames[m] has left #freedesktop [#freedesktop]

20:10 ___nick___ has quit []

20:11 ___nick___ has joined #freedesktop

20:11 ___nick___ has quit []

20:13 ___nick___ has joined #freedesktop

20:28 ngcortes has joined #freedesktop

21:08 ___nick___ has quit [Ping timeout: 480 seconds]

21:14 Haaninjo has quit [Quit: Ex-Chat]

22:10 i-garrison has quit []

22:13 i-garrison has joined #freedesktop

22:51 danvet has quit [Ping timeout: 480 seconds]

23:56 iNKa has joined #freedesktop

23:56 Brocker has quit [Read error: Connection reset by peer]