ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
Thymo_ has quit [Ping timeout: 480 seconds]
Thymo has joined #freedesktop
ngcortes has quit [Ping timeout: 480 seconds]
ximion has quit [Ping timeout: 480 seconds]
ximion has joined #freedesktop
ximion has quit []
jstein has quit [Ping timeout: 480 seconds]
i-garrison has quit []
i-garrison has joined #freedesktop
distronfk has joined #freedesktop
gofast has joined #freedesktop
gofast has quit [Remote host closed the connection]
distronfk has quit []
danvet has joined #freedesktop
<jljusten> hmm, it seems https://gitlab.freedesktop.org/mesa/mesa.git fetch is failing again. :( known?
<jljusten> ci seems to have prevented marge-bot from merging anything for some hours. probably related...
<mceier> seems to be working now
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<jljusten> mceier: hmm, it seems inconsistent. maybe it occasionally takes a different route?
<jljusten> maybe if it's inconsistent, it make it nearly inevitable that CI will fail at some point.
<mceier> dunno; at first it didn't work for me too, after a while I tried 2nd time and clone worked
<jljusten> I'm just trying `git fetch origin`. So, basically no traffic. I would say it is hanging more than 80% of the time.
<bentiss> daniels: thanks for fixing it. Which pod did you restart?
<bentiss> jenatali: re pushing to a fork: I vaguely remember either github or gitlab diabled that feature. I wonder if this is related
<bentiss> daniels: I wonder what is happening. I do see a lot of 504 errors on many projects (freedesktop/fdo-containers-usage.git mesa/drm.git wayland/wayland-protocols.git wayland/weston.git gfx-ci/mesa-performance-tracking.git /mesa/kmscube.git mesa/mesa.git...)
<bentiss> it's not like the pike at midnight when you looked at it (~220 per hour compared to ~25000 per hour), but still
<jljusten> https://gitlab.freedesktop.org/users/marge-bot/activity shows Mesa CI has prevented any change from merging (to Mesa) for 10 hours. It might be related to the network issues.
<bentiss> daniels: 2021-12-10T08:11:35.007479294Z stdout F {"correlation_id":"01FPHNS6RZ4A8BASHM2XJDFYKC","error":"handleGetInfoRefs: rpc error: code = Canceled desc = context canceled","level":"error","method":"GET","msg":"","time":"2021-12-10T08:11:35Z","uri":"/freedesktop/fdo-containers-usage.git/info/refs?service=git-upload-pack"}
<bentiss> the request started at 2021-12-10T08:08:47.569684687 and was cancelled at 2021-12-10T08:11:35.007479294Z
<bentiss> it seems gitaly is under too heavy load
<bentiss> ouch there is a request (01FPHPGBG2MJRY7P7HYB4NNC6T) that tried to clone mesa/drm for 10 min and got cancelled
<bentiss> 10 min seems a bit too excessive
<bentiss> so... I think what is happening is that we have too much CI running that do not leverage a git cache before starting. This adds too much load on gitaly, and makes it not answering in a good enough time
<jljusten> Is there some kind of access throttle? If I wait a while and run `git fetch origin`, it seems to often go through. But, if I run fetch again right away, it seems to get stuck. I saw it a couple times, but I'm not sure if it's a real pattern.
<bentiss> also why are we getting a lot of pulls on mesa/drm.git? from Google. I hope these are not scrapers but CI
<jljusten> bentiss: someone recently bumped the Mesa drm dependency beyond what debian has packaged, so maybe it started getting built from scratch in ci?
<bentiss> jljusten: could be
<bentiss> jljusten: regarding the access throttle, on the canceled requests, if I take the correlation ID, I do not see anything in the gitaly log. I do see the same correlation ID from successful requests
<jljusten> bentiss: it looks like daniels made the related gitlab-ci change in Mesa, but it appears that we were building libdrm before as well, so I don't think that is the change.
<bentiss> so my guess is that we do have requests that are not served by gitaly because it's too busy serving the ones it's currently working on
<bentiss> jljusten: well, it can also depend on how people started to work differently
<bentiss> actually.... mesa/drm is rather small, so probably not the culprit
<MrCooper> Mesa's CI only builds libdrm when building a docker image, which is rare
immibis has joined #freedesktop
<tomeu> bentiss: regarding pulls from google, it could be their git mirrors
<bentiss> well, mirroring should not start from scratch, so be short. But again, mesa/drm.git is relatively small, so probably just a side effect
immibis has quit [Remote host closed the connection]
<bentiss> since yesterday 21:00 (european time) I see *a lot* of successful git-upload-pack that take more than 1000s... this correlate to some kind of load
<bentiss> this was definitively not the case in the past 10 days
<bentiss> currently trimming the disks, which gives a lot more breath on them
aleksander has joined #freedesktop
immibis has joined #freedesktop
<daniels> bentiss: redis was the one which seemed really stuck, so I restarted large-5 and cleared out the redis PVC
<daniels> bentiss: but I also had stick in a ufw allow, because I was seeing a bunch of inter-node dropped packets - the fact that clone times out sometimes but can succeed if you start it immediately afterwards makes me wonder if it's really gitaly load, or if there are particular webservice pods which can't actually speak to gitaly?
<bentiss> still we do see way too many requests that are taking more than 1000 seconds, I wonder where they come from
<daniels> bentiss: the fact that wayland/wayland-protocols.git is erroring is interesting, because that repo is _tiny_
<daniels> bentiss: we could maybe look at enabling gitaly's upload-pack cache?
<bentiss> daniels: you mean add some minio caching of those repos?
<daniels> that's really new
<bentiss> that seems appealing
<bentiss> we'll have to monitor the disks though -> "This is because in some cases, it can create an extreme increase in the number of bytes written to disk."
<daniels> yeah, it does seem useful, but we'd definitely need to make sure it landed in local-SSD storage rather than trying to replicate the cache across Ceph since that would just exacerbate the problem :P
<daniels> right
Chaz6 has joined #freedesktop
ppascher has quit [Ping timeout: 480 seconds]
<bentiss> doesn't seem to be available in the chart though -> https://docs.gitlab.com/charts/charts/gitlab/gitaly/index.html (from a very quick look)
agd5f has quit [Read error: Connection reset by peer]
aleksander0m has joined #freedesktop
aleksander has quit [Ping timeout: 480 seconds]
Haaninjo has joined #freedesktop
agd5f has joined #freedesktop
<pq> I think the retry of that is going to time out... has been stuck around git-clone for three minutes.
Haaninjo has quit [Ping timeout: 480 seconds]
<pq> me fecthing wayland-protocols git seems stuck as well
<pq> sometimes works, mostly not...
ximion has joined #freedesktop
<daniels> postgres pinned at 100% CPU. that's not great.
Haaninjo has joined #freedesktop
xingwozhonghua has quit []
<bentiss> FWIW, it seems most stuck projects are on gitaly-0
<bentiss> I'm going to migrate this pod on an other node and see if that helps
<daniels> bentiss: nod, just looking at server-2 and server-3 at least, there's a huge amount of CPU + mem use from vector+elastic
<bentiss> elastic it must be because I am doing queries since this morning
<bentiss> vector seems a little bit off. This morning I restarted the aggregator, it was using way too much memory
<bentiss> daniels: but note that I migrated gitaly-0 from large-5 to server-2
* daniels nods
<daniels> maybe we should also look at how we partition services, e.g. gitaly vs. postgres/redis vs. webservice vs. sidekiq?
<bentiss> ack, but why did this started yesterday at 20:00 UTC???
<daniels> don't look at me, I was out at the time :P
<bentiss> same here
<bentiss> maybe restart of postgres?
<bentiss> it's been up for 96d
<daniels> sure, can try that, but kicking vector first might be an idea since sustained >50% CPU seems a bit much?
<bentiss> sure, let's go on a killing spreee (of pods, of course)
<bentiss> actually, I'll kill first the nginx ones to clear the logs, or there is a chance vector tries to re-read them from start
<bentiss> maybe a little bit too violent... connection_refused :)
<bentiss> elevator theme running in background
<daniels> bentiss: god I really love jq
<bentiss> why?
<daniels> for i in $(metal device get --project-id $PACKET_PROJECT -o json | jq -r '.[] | select(.hostname | contains("k3s")) | .id'); do metal device get --id $i -o json | jq '{ "hostname": .hostname, "ip_addresses": [.ip_addresses[].address] }'; done
<daniels> I'm looking at k3s-server-2 (where postgres is running), and seeing a lot of ufw drops - both from server-4 and large-5, which seems ... weird?
<bentiss> indeed
<bentiss> should I kill postgres then?
<bentiss> daniels: FWIW, I have a script I am running to get all IPs: packet device get --project-id $PACKET_PROJECT_ID --json | jq 'sort_by(.hostname) | .[] | {(.hostname): {sos: ("ssh -o PubkeyAcceptedKeyTypes=ssh-rsa " + .id + "@sos." + .facility.code + ".platformequinix.com"), ip: (.ip_addresses[0].address)}}' | jq -s add
<bentiss> that also gives me the SOS console with the correct option to be able to login on to it
<bentiss> daniels: it seems that you fixed it, whatever you did
<daniels> really?
<daniels> that was just kicking vector
<bentiss> I now manage to clone a repo I couldn't before
<daniels> clones are still timing out, at least from here ...
<daniels> it's non-deterministic
<bentiss> I bet killing nginx was also part of the "fix"
<bentiss> yep, not fixed
<daniels> one thing I'm noticing is that on k3s-server-2, there are a _lot_ of reqs from k3s-server-3 to etcd getting dropped by ufw
<daniels> but there's an explicit ufw allow rule for k3s-server-3 -> k3s-server-2:etcd
<daniels> so wtf?
<bentiss> I was about to ask the same
<daniels> ooh
<bentiss> that is suspensful
aleksander0m has quit []
<daniels> heh, not intentionally
<daniels> one of the ways we can land in a ufw block is if a packet arrives with an invalid conntrack state
<daniels> so I tried bumping nf_conntrack_max
go4godvin has joined #freedesktop
<bentiss> getting blocked once again...
<bentiss> I really thought it worked :(
<bentiss> daniels: though you know we are not using NFT?
<daniels> bentiss: yeah, I know it's iptables instead, but conntrack still falls under the core netfilter stuff
<bentiss> daniels: OK, I trust you on this one :)
<bentiss> looking at the prometheus stats, we are not under attack, the number of connections is similar to what we had in the previous week
<daniels> bentiss: here's something weird
<daniels> I've got both the nginx and webservice logs set to follow + grep for my IP
<daniels> doing git clone mesa/mesa over and over
<daniels> in all cases, I see nginx forwarding the req to the webservice svc
<daniels> when it fails, I don't see anything logged in any of the webservice pods
<daniels> `kubectl --context fdo -n gitlab logs --prefix=true --selector='app==webservice' -c webservice --max-log-requests=8 -f | grep $srcip` (or --selector='app.kubernetes.io/name==ingress-nginx' for nginx)
<daniels> (and obviously not `-c webservice` for nginx)
<daniels> ok, failing reqs are isolated to pod/gitlab-prod-webservice-default-fbb598946-chq8q;
<bentiss> daniels: following the errors on the internal grafana, I thought we were having nginx->webservice OK, but some times webservice->gitaly not
<daniels> that's running on server-4
<bentiss> cordon that node and kick the service?
<daniels> it's been restarted, because ... Redis::CannotConnectError (Error connecting to Redis on gitlab-prod-redis-master.gitlab.svc:6379 (Redis::TimeoutError)):
<daniels> that was the point at which everything went to shit
<daniels> so back to redis being an issue
<daniels> yeah ... ouch
<daniels> 127.0.0.1:6379> info clients
<daniels> # Clients
<daniels> connected_clients:448
<daniels> client_recent_max_input_buffer:40960
<daniels> client_recent_max_output_buffer:0
<daniels> blocked_clients:149
<daniels> tracking_clients:0
<daniels> clients_in_timeout_table:149
ximion has quit []
<jekstrand> Good, it's not just me. :)
thaller has joined #freedesktop
thaller_ has quit [Ping timeout: 480 seconds]
<daniels> bentiss: starting to think that the problem lies in between webservice and workhorse ... ?
<daniels> bentiss: on a hung request, I see webservice returning status 200 duration_s: 0.02406, and for the same correlation_id workhorse returning status 500 duration_ms 10884 (which is when I got bored and ^C)
<daniels> bentiss: ok, so webservice just passes an 'ok, go fetch from this gitaly server' back to workhorse, and it's workhorse<->gitaly which never does anything
<daniels> but it's really weird, because webservice can talk to gitaly just fine ...
<daniels> ffs
moses has joined #freedesktop
jarthur has joined #freedesktop
ximion has joined #freedesktop
i-garrison has quit []
<emersion> filled with spam
<emersion> i'm surprised we don't get spam in issues and such
<daniels> yeah, ajax had a bot called 'hallmonitor' he used to clear it out, but that doesn't get run anymore it seems
<daniels> bentiss: so I'd bottomed it out to workhorse not being able to submit grpc reqs to gitaly when rails+shell can do it just fine, but not why, and in order to get verbose logging out of workhorse you need to install sentry
<daniels> bentiss: we now have a sentry instance installed, and workhorse is logging to it
<daniels> bentiss: however after doing all that ... it is working perfectly
<anholt> daniels: can I order you, like, several beers?
<daniels> anholt: a friend already sent me a bunch as a gift a few weeks ago, so I already have a truly life-threatening amount. but thanks :)
<bentiss> daniels: \o/ congrats! sorry I dropped the ball today, got some non work related duties to attend to
<daniels> bentiss: np at all
<daniels> bentiss: so now we have sentry if we want to use it for anything :P
<bentiss> daniels: cool :) (I'll have to google for what this is I must confess. The name rings a bell but that's about it)
* bentiss is going to bed
<daniels> heh, night!
ngcortes has joined #freedesktop