daniels changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
lsd|2 has quit [Remote host closed the connection]
alpernebbi has quit [Ping timeout: 480 seconds]
alpernebbi has joined #freedesktop
lsd|2 has joined #freedesktop
cisco87 has quit [Remote host closed the connection]
cisco87 has joined #freedesktop
damian has quit []
bilboed has quit [Ping timeout: 480 seconds]
bilboed has joined #freedesktop
itaipu has quit [Ping timeout: 480 seconds]
mripard has quit [Ping timeout: 480 seconds]
privacy has joined #freedesktop
alatiera has quit [Quit: Connection closed for inactivity]
<jenatali>
eric_engestrom: :( unfortunately I'm still pretty sure this is a runner problem, not a dozen problem, which really means it's out of my hands to address
<eric_engestrom>
daniels: thanks! `values/marge-bot/run_marge.sh` is what I was looking for
<eric_engestrom>
jenatali: yeah I think you're right that it's a runner problem; I didn't realize you didn't maintain the runner though, sorry for the pings :]
bmodem has joined #freedesktop
Inline has joined #freedesktop
<daniels>
eric_engestrom: alatiera (& the GSt project generally) maintain the Windows runner(s)
<daniels>
currently it's singular, hence the long queue, and gst has also been a very noisy neighbour up until late, hence the massively variable runtimes
alatiera has joined #freedesktop
<eric_engestrom>
ack, thanks!
<jenatali>
Microsoft does provide the licenses for Windows on the runners though :)
<eric_engestrom>
(I love how you summoned them into the channel :P)
<eric_engestrom>
haha jenatali
<alatiera>
it's magic
<jenatali>
Which I mean to say, through partnership, not sales lol
<eric_engestrom>
very off-topic, but I thought MS had dropped the idea of windows licenses now?
<alatiera>
I saw the matrix ping and realized my irc had dced
<jenatali>
Nah, it still has to be purchased once per machine, and then that machine can upgrade forever
<eric_engestrom>
ack
* eric_engestrom
hasn't installed windows anywhere in... oof, I'm getting old
<jenatali>
Which pays my salary so I can't really complain about this business model too much
<eric_engestrom>
hehe
<pinchartl>
the last time I bought a machine with windows, I had to battle for a year to get reimbursed for the OS license that I ws forced to get
<pinchartl>
jenatali: hopefully that didn't affect your salary :-)
<jenatali>
Hah
<pinchartl>
it was a loooooong time ago, before I was old and grumpy
<pinchartl>
I was young and grumpy I suppose
<jenatali>
If it was that long ago then it was probably before I was even here
<jenatali>
Though I am coming up on 12 years here in a few months... Crazy how time flies
<pinchartl>
I'm trying to run a job locally, with a container image from the registry. podman can run the container fine (with a naive 'podman container run -it $registry_url bash'), but the qemu I start in the container complains that it can't initialize KVM. indeed, there's no /dev/kvm in the container. I have a /dev/kvm, and qemu can use it fine on the host. what am I missing to use kvm inside the container
<pinchartl>
?
<bentiss>
right now, on that runner the command I linked above gives me only 144 fds... so unlikely to be correct when journalctl -f still complains
<karolherbst>
maybe some config being wrong?
<karolherbst>
worst case, strace the bot and see what happens
<bentiss>
karolherbst: it's not the bot, it's something else on the machine
<bentiss>
the bot is just a side effect of not being able to get an fd
<karolherbst>
could be some funky python bug, but yeah...
<karolherbst>
bentiss: try `sysctl fs.file-nr`
<emersion>
i don't think you'd see this error if another process was responsible
<karolherbst>
bentiss: inotify is an API to track changes to filesystems, it's not related to any fd limits
<bentiss>
so not inotify related?
<karolherbst>
no
<karolherbst>
max_user_watches is just the limit of how many watchers on events there can be in total
<karolherbst>
however
<bentiss>
k, and the tool above returns Total inotify Watches: 5559
<bentiss>
Total inotify Instances: 238
<karolherbst>
I don't know what python does which triggers the value: Error { kind: Io(Os { code: 24, kind: Uncategorized, message: "Too many open files" })
<bentiss>
it's rust, not python
<karolherbst>
ahhh
<karolherbst>
then it's rust
<karolherbst>
I'd check with strace and see what fails
<bentiss>
well, journalctl -f also fails
<karolherbst>
mhhh...
<karolherbst>
there is definetly something funky going on
<bentiss>
that was my point: the bot is gone, and I still have issues on the server
<karolherbst>
yeah...
<karolherbst>
check strace then
<karolherbst>
this error can mean anything at this point
<bentiss>
strace on what?
<karolherbst>
whatever fails
<karolherbst>
so maybe journalctl -f?
<bentiss>
inotify_init1(IN_NONBLOCK|IN_CLOEXEC) = -1 EMFILE (Too many open files)
<karolherbst>
uhhh... okay, so it is inotify related after all...
<bentiss>
don't know why hookiedookie would use inotify though
<karolherbst>
yeah.. me neither
<bentiss>
it's a webserver, so a socket would do
<karolherbst>
ohh
<karolherbst>
maybe just listen for files changing on disk
<karolherbst>
to update the RAM cache or something
<karolherbst>
or regenerate files or whatever
<bentiss>
my memory muscle started typing "ps -aef"... and ps -aef | grep gpg | grep defunct | wc -l
<karolherbst>
mhhh
<bentiss>
6117
<karolherbst>
I wonder which of the inotify limits you are hitting.. let's see...
<bentiss>
I wonder why I have so many gpg defunct processes
<karolherbst>
"The user limit on the total number of inotify instances has been reached. "
<karolherbst>
mhhh
<karolherbst>
that's EMFILE
<bentiss>
yeah, that matches the numbers I gave earlier
<karolherbst>
sysctl fs.inotify
<karolherbst>
`fs.inotify.max_user_instances = 128` I guess?
<bentiss>
oh, I know why I use inotify in hookiedookie: it looks for changes in the Settings.tmpl file to reload itself in case there is a change
<karolherbst>
try bumping that
<bentiss>
karolherbst: that helped for journalctl -f > no more errors
<karolherbst>
cool
<bentiss>
what would be a reasonable value?
<bentiss>
256? 512? 1024?
<karolherbst>
maybe try 1024 and see if that's enough?
<bentiss>
k, I'll bump it on all of the nodes
<karolherbst>
I don't know what's reasonable here but 128 is apparently not enough :)
<bentiss>
I just need to remember how to make that setting persistent
<karolherbst>
/etc/sysctl/
<karolherbst>
uhm..
<karolherbst>
sysctl.d/
<bentiss>
/etc/sysctl.conf and sysctl.d
<bentiss>
yeah
<bentiss>
k, bumped on every node, we'll see if that fails once again
<karolherbst>
I wished I had time this year for working on marge :')
<bentiss>
though we probably want to keep it close to upstream
<bentiss>
the marge we are using is 7 months old, not sure what happened upstream since
<karolherbst>
~~maybe once I'm on PTO~~
<karolherbst>
fair...
<bentiss>
PTO is for resting, not marging
<karolherbst>
I know
<karolherbst>
and I'm on PTO next week
<bentiss>
nice
<karolherbst>
sooo...
<karolherbst>
not sure I find some time to work on marge. however if anybody feels like working on marge, I think implementing what I mentioned would help tremendously in situations like this
<bentiss>
karolherbst: anyway, thanks heaps for the help of the max_user_thingy
<karolherbst>
no problem. remember to use strace for cases like this as it is a power tool :D
<eric_engestrom>
not tested yet though, and not really reviewed either
<karolherbst>
cool!
<eric_engestrom>
hopefully when I'm back in january I can test it and get upstream to merge it
<eric_engestrom>
I also have a branch based on that one that adds `--job-failure={warn,abort,ignore}` so that we can either cancel the whole thing when a job fails, or at least post a message in the MR so that users can retry asap
<eric_engestrom>
a future improvement will be to add `--job-failure-delay-abort N` to give some time between the job failure and the abort, eg. 10min, so that we don't waste too much resource but also give a chance to retry
<karolherbst>
I think my idea was to only do that once marge actually moves to the next MR
<eric_engestrom>
yeah that's what the MR I posted does
<eric_engestrom>
but I was talking about other things we can also do
<karolherbst>
right..
<karolherbst>
yeah.. I don't really know, I think reducing the load on the CI system is more important than having jobs continue to run for a while which might pass anyway
<eric_engestrom>
yeah, "reducing ci load" has been my focus for the last 2-3 months
<eric_engestrom>
this morning I merge the MR that stops mesa from re-running all the test right after merging (since we just ran them to get to the point where the MR is merged)
<karolherbst>
yeah.. that should help a lot
<karolherbst>
it makes sense to do that if we wouldn't rebase, but as we do...
<eric_engestrom>
hopefully with that 2x waste, the 50% reduction in ci load will result in noticable improvement in marge pipeline times
<karolherbst>
yeah
<karolherbst>
so the only jobs which run post MR are like gitlab pages stuff and such?
<eric_engestrom>
exactly
<karolherbst>
good
<eric_engestrom>
no "and such", that's it
<eric_engestrom>
just `pages`
<karolherbst>
yeah.. 2x in capacity should get us going for a while :)
<eric_engestrom>
well, it's only _actually_ 2x capacity increase in the case of back-to-back MRs like we've had the last couple of days
<eric_engestrom>
when there's time in between MRs the difference isn't that high
<jenatali>
Which have partially been my fault (kinda) :(
<karolherbst>
eric_engestrom: well.. isn't that what capacity means?
<karolherbst>
like how many MRs we could merge at most in a day :)
<eric_engestrom>
yeah I guess
<eric_engestrom>
^^
<eric_engestrom>
jenatali: not your fault since you're not the one maintaining that runner :P
privacy has joined #freedesktop
<jenatali>
Yeah but I maintain the test pass and the code under test
<eric_engestrom>
jenatali, alatiera: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/52442924 -> the windows build job started almost half an hour into the pipeline (all the other test jobs have finished) and it hasn't started actually compiling yet, after 10+ minutes
<eric_engestrom>
should we consider the windows farm offline?
<eric_engestrom>
haha, it heard me I think, it just started compiling
immibis_ is now known as immibis
<alatiera>
Queued: 8 minutes 16 seconds: Not that bad given that we are at half the runners atm
<eric_engestrom>
that's on top of the 7+10+8 pending of the previous windows jobs in the chain :/
<alatiera>
@eric_engestrom go for it if you think it's necessary
<eric_engestrom>
ack, I'll merge it now
<alatiera>
the runner is at constant 100% cpu utilization as you'd expect
<eric_engestrom>
:(
<jenatali>
eric_engestrom: I would assign it, but I can't from my phone. When I click edit on the assigned to field, and the keyboard pops out, it closes the side bar
<eric_engestrom>
yeah I have the same web ui bug on my phone
<eric_engestrom>
I usually just post a comment with `/assign @marge-bot` instead
<eric_engestrom>
but I just merged it so no need to do anything
<jenatali>
:O That's a thing? That's cool
alyssa has joined #freedesktop
<alyssa>
eric_engestrom: do we need to cancel pipeline of currently running MR since it won't merge now?
<alyssa>
oh beat me to it
<alyssa>
:p
mvlad has quit [Remote host closed the connection]
<eric_engestrom>
:)
blatant has joined #freedesktop
tzimmermann has quit [Quit: Leaving]
Haaninjo has joined #freedesktop
<jenatali>
alatiera: Please let me know as soon as you expect any kind of improvement, since I'd like to re-enable Windows CI as soon as it won't cause issues for folks
<jenatali>
The longer it's off, the more people make changes that break my drivers or MSVC builds :)
<karolherbst>
I think the plan was to disable post merge pipelines and after that I'm sure it's fine to reenable?
damian has joined #freedesktop
<karolherbst>
what's blocking that anyway?
<alatiera>
We have disabled post-merge in gst for a couple years now, and only do a nightly schedule just to make sure
<alatiera>
and the schedule is very recent we were fine without it
<alatiera>
jenatali: ack, currently wiped the runner and started again from scratch cause I couldn't make it to work, so dunno
<MrCooper>
Mesa's post-merge pipeline is mostly empty now, only the pages job if needed
<jenatali>
karolherbst: Windows jobs have been turned off (except the build I think?) for post-merge anyway
<karolherbst>
ohh that already landed
<jenatali>
The problem is the Dozen job is super CPU-intensive, and it fights with other stuff that's running on the system, and if there's only one runner it gets too busy to handle that job appropriately
<karolherbst>
I see
<MrCooper>
hmm, that sounds like a gitlab-runner misconfiguration? The same number of instances of the job can end up run concurrently regardless of which pipelines the job does (not) exist in
mvlad has joined #freedesktop
<alyssa>
maybe dozen needs dedicated runners?
blatant has quit [Ping timeout: 480 seconds]
<eric_engestrom>
oops, sorry about gitlab being a bit unresponsive, I think that's my fault for sending a bunch of requests at once from a script
<eric_engestrom>
I imediately killed the script so it will come back soon
<eric_engestrom>
yeah looks like it's back to normal :)
blatant has joined #freedesktop
<eric_engestrom>
karolherbst:
<eric_engestrom>
> I think the plan was to disable post merge pipelines and after that I'm sure it's fine to reenable?
<DavidHeidelberg>
eric_engestrom: are you sending more, because GL look pretty dead "D
<eric_engestrom>
DavidHeidelberg: it works fine for me now
<DavidHeidelberg>
lucky u
<DavidHeidelberg>
ok, good recovered, but minute ago it got stuck on loading page
<eric_engestrom>
it was unresponsive for 2-3 minutes but it's been fine for a while now
<eric_engestrom>
well, "while" = 5+ minutes
<DavidHeidelberg>
bentiss: Can we help somehow? If it would be meaningful I would provide some server or ask on conferences if any corp want to put some extra $ into FDO? :)
<bentiss>
DavidHeidelberg: for the gitlab instance in itself, we would welcome any extra runner, but having more nodes for the cluster would require them to be hosted on Equinix datacenters
<bentiss>
though I think if we get extra runners, we could trade some of the runners for k8s nodes
blatant has quit [Quit: WeeChat 4.1.2]
ximion has joined #freedesktop
<jenatali>
alyssa: I wish I could make that happen...
<eric_engestrom>
weren't we talking about all the money microsoft has earlier? :P
<jenatali>
It's not strictly money, it's a skillset and time for managing a machine as well
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<daniels>
^
<airlied>
just have 10 windows runner machines, with another machine that is cycling through and reinstalling them all from scratch :-P
<jenatali>
We actually have an internal tech that sets up machines to dual-boot. You join a minimal OS environment as a runner, and one of the things you can tell it to do is to reboot and install an OS and join that as a runner
<daniels>
yeah, having something like that would be great - surprisingly fd.o people are not natural Windows admins
<daniels>
the last time I did it was NT4
<pinchartl>
the last time I had to administer a windows machine, it required a keyboard and a mouse
<jenatali>
Let me ask around a little bit and see if we can contribute more than just licenses. It'd be nice to get some proper Azure compute time on beefy machines for Windows CI
<jenatali>
I wouldn't really expect an answer before the new year though, things shut down around here come December
<airlied>
unless we can somehow tie it to OpenAI :-P
i509vcb has joined #freedesktop
<alyssa>
jenatali: then maybe dozen fraction needs to be bumped
* alyssa
shrugs
<jenatali>
Wouldn't matter. That'd decrease the time, but when the machine is under heavy load, some test results start to go missing
<alyssa>
maybe dozen shouldn't be ci'd upstream then yet
<alyssa>
(i'm sympathetic to the challenges of running ci at mesa/mesa scale, this is a major reason why asahi ci upstream is not in the cards)
<jenatali>
I dunno. Seems like even just the build jobs were having problems
alyssa has quit [Quit: alyssa]
vkareh has quit [Quit: WeeChat 4.1.1]
sima has quit [Ping timeout: 480 seconds]
Haaninjo has quit [Quit: Ex-Chat]
thelounge14738 has quit []
thelounge14738 has joined #freedesktop
privacy has quit [Quit: Leaving]
itaipu has joined #freedesktop
<jenatali>
alatiera: I think I've asked this before, but what's the config of the machines that host the Windows CI? I'm going to make a run at asking for resources from our side for Mesa (at least) and would probably want something comparable
<alatiera>
I will find and send you tmr the details