#freedesktop on 2023-03-30 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:45 ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org

00:10 co1umbarius has joined #freedesktop

00:12 columbarius has quit [Ping timeout: 480 seconds]

01:00 DragoonAethis has quit [Quit: hej-hej!]

01:00 DragoonAethis has joined #freedesktop

02:12 marcheu_ has joined #freedesktop

02:12 lkundrak has quit [Remote host closed the connection]

02:12 mslusarz has quit [Remote host closed the connection]

02:12 glisse_ has quit [Remote host closed the connection]

02:12 glisse has joined #freedesktop

02:14 marcheu has quit [Ping timeout: 480 seconds]

02:17 lkundrak has joined #freedesktop

02:22 mslusarz has joined #freedesktop

02:31 karolherbst_ has joined #freedesktop

02:38 karolherbst has quit [Ping timeout: 480 seconds]

02:47 lkundrak_ has joined #freedesktop

02:47 lkundrak has quit [Read error: Connection reset by peer]

02:48 mslusarz has quit [Read error: Connection reset by peer]

02:48 mslusarz has joined #freedesktop

02:48 glisse has quit [Read error: Connection reset by peer]

02:49 glisse has joined #freedesktop

02:52 marcheu_ has quit [Remote host closed the connection]

02:52 marcheu has joined #freedesktop

03:02 Leopold_ has joined #freedesktop

04:11 damian has quit [Read error: Connection reset by peer]

04:12 damian has joined #freedesktop

04:29 ximion1 has joined #freedesktop

04:32 ximion has quit [Ping timeout: 480 seconds]

05:01 ximion1 has quit [Quit: Detached from the Matrix]

05:13 danvet has joined #freedesktop

05:39 Haaninjo has joined #freedesktop

06:08 <DavidHeidelberg[m]> sanity job is pending 8 minutes, that doesn't look right: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21971

06:48 mvlad has joined #freedesktop

07:26 <MrCooper> robclark: ideally toggling the farms should be out of band, it's noise in the Git history

07:30 <bentiss> MrCooper: but given how the CI is articulated now, it should be just an include away to externalize those few variables, no?

07:33 <bentiss> daniels: Regarding the 5xx errors we get on S3, I think I cornered out what is happening: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#flapping-osds -> I have set noup,nodown yesterday night, and the cluster seems in much better shape now

07:34 <bentiss> I also enabled per gitaly backup, even if this morning I realized the filename was incorrect, so I had to restart them

07:34 <MrCooper> bentiss: not sure, but yeah it might not be hard, just someone has to do it :)

07:35 alanc has quit [Remote host closed the connection]

07:36 alanc has joined #freedesktop

07:37 <bentiss> daniels: the bad news is that I suspect we need at least to switch to wireguard-native in k3s, because this is related to network issues (https://docs.k3s.io/installation/network-options#migrating-from-wireguard-or-ipsec-to-wireguard-native)

07:37 <bentiss> the switch to wireguard-native involve a short downtime while we reboot all the machines

07:39 <bentiss> I'd like also to upgrade rook to 1.11, it has some better metrics gathering, but I'd like to do this once the flapping as stalled

07:40 <ishitatsuyuki> looks like I'm a developer on mesa/mesa, but not actually a member of the group "mesa", so I don't have CI-OK. should I ask to get added to the group "mesa" instead?

07:41 <bentiss> ishitatsuyuki: either mesa sa guest at least, either in mesa/ci-ok, yes

07:43 <eric_engestrom> robclark, MrCooper, bentiss: I was talking about that with someone a couple of days ago because I thought the same, but actually the problem with having the farm toggle in another repo is that it's great for disabling, but for re-enabling if some test got broken in the meantime, when you re-enable the farm you end up breaking random unrelated MRs, so re-enabling needs to go through an MR

07:45 <eric_engestrom> our conclusion was that disabling a farm should be a direct push, which will make the MR currently being merged fail but it would have probably failed anyway since the farm is down, that way it doesn't have to wait for the whole queue to be either processed or manually flushed, and re-enabling a farm needs to be done in an MR so that anything that broke in the meantime gets fixed before it's turned back on

08:00 <MrCooper> hmm yeah, good point

08:01 <bentiss> eric_engestrom: Good point. Maybe I am over-engineering it, but I think it would be nice to have each farm in a child pipeline. If the separate repo says that the farm is offline -> the pipline is not generated for this farm. If it's online but the last main pipeline was allowed-to-fail and failed -> mark as allowed-to-fail, and if the last was a success -> mark as required

08:02 <bentiss> this way, you can disable the farms in a separate repo, each farm gets its own pipeline (so clearer for users), and you can re-enable them but not make them block if the farm has been down for some time

08:03 woodwose has quit [Ping timeout: 480 seconds]

08:05 <eric_engestrom> bentiss: I've never user a child pipeline so I have no idea how it works and what its constraints are, but your solution sounds reasonable

08:05 <bentiss> eric_engestrom: we use them in ci-templates if you want to have a look at how they look from the UI point of view

08:05 <eric_engestrom> also, I just caught up with #dri-devel and I see that alyssa already said this there, so s/someone/alyssa/ in my earlier message :P

08:06 <bentiss> eric_engestrom: but the interesting bit is that you can also generate the child pipeline in a job, so it's get dynamic

08:08 <eric_engestrom> which is where "pull this external file and decide whether these jobs exist or not" comes in

08:09 <eric_engestrom> interesting yeah

08:09 <bentiss> exactly :)

09:19 thaller has quit [Ping timeout: 480 seconds]

09:28 AbleBacon has quit [Read error: Connection reset by peer]

09:28 vbenes has quit [Ping timeout: 480 seconds]

09:37 <hakzsam> radv-radven-vkcts job slow ? https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39027704

09:37 <hakzsam> elapsed time is 30 minutes

09:49 <daniels> hakzsam: the machine died during the first run, so it started a new one

09:49 <hakzsam> ok

09:53 Leopold_ has quit [Remote host closed the connection]

09:53 Leopold_ has joined #freedesktop

09:57 <eric_engestrom> btw, I rewrote the pipeline trace jq script in python so that it fits in ci-fairy instead of making it a job script that each repo would have to copy (and made a few improvements, main ones being making the format compatible with modern perfetto, and adding queue time): https://gitlab.freedesktop.org/eric/ci-templates/-/commits/ci-fairy-pipeline-trace

09:57 <eric_engestrom> but I'm hitting an issue where `when: always` or `rules: [when: always]` doesn't work and the job gets skipped if any previous job is cancelled (I'm cancelling most jobs in my test pipelines to avoid burning resources for nothing): https://gitlab.freedesktop.org/eric/mesa/-/commits/ci-export-pipeline-trace

09:58 <eric_engestrom> bentiss, daniels: ^ since you were interested :)

10:02 woodwose has joined #freedesktop

10:02 <bentiss> eric_engestrom: do we care about canceled pipelines though?

10:04 <bentiss> actually.... maybe we can use hookiedookie for that: we register a pipeline event, and when the pipeline is terminated, hookiedookie calls ci-fairy pipeline_trace whcih then adds a comment on the MR

10:05 <eric_engestrom> beyond me testing my code, I was thinking about an MR where "things" take too long and Marge times out, and the user cancels it to avoid burning resources, but we still want to generate the trace to see why "things" took too long

10:05 <bentiss> yeah, so it should be out of the pipeline. A webhook with hookiedookie might do the job :)

10:06 <eric_engestrom> sounds good :)

10:07 <eric_engestrom> I'll MR my ci-fairy change then, and let you convert my mesa ci commit into a hookiedookie thing :)

10:07 <bentiss> heh. no guarantees I can do that today or before next week

10:08 <bentiss> once we got the json output, what tool do you call to get a proper png?

10:09 <bentiss> actually... can't we use gitlab's graph facility instead of a plain json?

10:10 <eric_engestrom> https://gitlab.freedesktop.org/freedesktop/ci-templates/-/merge_requests/170

10:10 <eric_engestrom> > Can be opened in chrome://tracing or https://ui.perfetto.dev

10:11 <eric_engestrom> I included a screenshot of how it looks

10:12 <bentiss> with mermaid, we could integrate it directly as a comment in the MR...

10:13 <bentiss> https://gitlab.freedesktop.org/bentiss/test/-/merge_requests/4#note_1846574 for example

10:13 <eric_engestrom> sure why not, but I worry that it might be huge and/or unreadable

10:13 <eric_engestrom> (there's a *lot* of jobs)

10:14 <bentiss> heh, maybe :)

10:20 <eric_engestrom> bentiss: "heh. no guarantees I can do that today or before next week" we've lived without this for a long time, there's no rush :P

10:21 <eric_engestrom> in the meantime any of us can run `ci-fairy pipeline-trace` locally

10:29 <daniels> eric_engestrom: oh nice, thanks!

10:30 <daniels> bentiss: and awesome work finding the s3 issue - the OSDs were really just sniping each other? ha

10:30 vbenes has joined #freedesktop

10:31 <bentiss> daniels: I guess it's a combination of a lot of things. But as soon as they start getting flapping, the cluster enters a state where it detects slow ops everywhere.

10:31 <bentiss> I haven't seen a slow op since this morning

10:37 <daniels> \o/

10:39 <bentiss> yep :)

10:39 <bentiss> I'll turn the noup/nodown flags once the backups have terminated

10:39 <bentiss> (and that I removed the last 1TB backup from last week)

10:55 thaller has joined #freedesktop

11:13 vbenes has quit [Ping timeout: 480 seconds]

11:13 lack has quit [Ping timeout: 480 seconds]

11:25 vbenes has joined #freedesktop

11:25 vbenes has quit [Remote host closed the connection]

12:13 vkareh has joined #freedesktop

12:32 karolherbst_ is now known as karolherbst

13:12 fr0hike has joined #freedesktop

13:15 nehsou^ has quit [Remote host closed the connection]

13:37 <bentiss> sigh... I removed the noup/nodown flags half an hour ago, and slowops are popping up :/

13:45 woodwose has quit [Ping timeout: 480 seconds]

13:47 <bentiss> HEALTH_OK -> upgrading rook

13:59 woodwose has joined #freedesktop

14:22 woodwose has quit [Remote host closed the connection]

14:23 woodwose has joined #freedesktop

14:24 lack has joined #freedesktop

15:36 jramsay has joined #freedesktop

15:39 lack has quit [Ping timeout: 480 seconds]

15:54 karolherbst_ has joined #freedesktop

15:58 karolherbst has quit [Ping timeout: 480 seconds]

16:19 MrCooper has quit [Remote host closed the connection]

16:21 MrCooper has joined #freedesktop

16:25 MrCooper has quit [Remote host closed the connection]

16:26 MrCooper has joined #freedesktop

17:10 FileJanitor has joined #freedesktop

17:13 karolherbst_ is now known as karolherbst

17:17 FileJanitor has left #freedesktop [#freedesktop]

17:26 ximion has joined #freedesktop

17:55 <zmike> my merge just got btfo by a shader-db run on nouveau https://gitlab.freedesktop.org/mesa/mesa/-/jobs/39058857

17:55 <zmike> or something

18:02 a-l-e has joined #freedesktop

18:04 a-l-e has quit []

18:10 mvlad has quit [Remote host closed the connection]

18:48 a-l-e has joined #freedesktop

18:48 AbleBacon has joined #freedesktop

18:57 vkareh has quit [Quit: WeeChat 3.6]

19:10 fr0hike has quit [Remote host closed the connection]

19:11 fr0hike has joined #freedesktop

19:37 a-l-e has quit [Quit: Leaving]

19:41 fr0hike has quit []

20:39 danvet has quit [Ping timeout: 480 seconds]

22:01 Leopold_ has quit [Remote host closed the connection]

22:09 Leopold has joined #freedesktop

22:20 mohamexiety has joined #freedesktop

22:24 <daniels> zmike: that machine was desperately unhealthy, so I nuked it

22:24 <zmike> tremendous

22:34 Haaninjo has quit [Quit: Ex-Chat]

23:11 krushia has quit [Ping timeout: 480 seconds]

23:15 krushia has joined #freedesktop

23:21 infernix has quit [Ping timeout: 480 seconds]

23:22 infernixx has joined #freedesktop