daniels changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
lsd|2 has quit [Remote host closed the connection]
alpernebbi has quit [Ping timeout: 480 seconds]
alpernebbi has joined #freedesktop
lsd|2 has joined #freedesktop
cisco87 has quit [Remote host closed the connection]
cisco87 has joined #freedesktop
damian has quit []
bilboed has quit [Ping timeout: 480 seconds]
bilboed has joined #freedesktop
itaipu has quit [Ping timeout: 480 seconds]
mripard has quit [Ping timeout: 480 seconds]
privacy has joined #freedesktop
alatiera has quit [Quit: Connection closed for inactivity]
mripard has joined #freedesktop
lsd|2 has quit [Quit: KVIrc 5.0.0 Aria http://www.kvirc.net/]
ximion has quit [Quit: Detached from the Matrix]
todi1 has joined #freedesktop
todi has quit [Ping timeout: 480 seconds]
nektro has quit [Remote host closed the connection]
nektro has joined #freedesktop
bmodem has joined #freedesktop
sima has joined #freedesktop
AbleBacon has quit [Read error: Connection reset by peer]
nektro has quit [Remote host closed the connection]
nektro has joined #freedesktop
mvlad has joined #freedesktop
___nick___ has joined #freedesktop
i509vcb has quit [Quit: Connection closed for inactivity]
nektro has quit [Remote host closed the connection]
nektro has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
___nick___ has quit []
___nick___ has joined #freedesktop
blatant has joined #freedesktop
thaller is now known as Guest9395
thaller has joined #freedesktop
privacy has quit [Quit: Leaving]
Guest9395 has quit [Ping timeout: 480 seconds]
tzimmermann has joined #freedesktop
MrBonkers has quit [Quit: The Lounge - https://thelounge.chat]
MrBonkers has joined #freedesktop
ximion has joined #freedesktop
ximion has quit [Quit: Detached from the Matrix]
bmodem has quit [Ping timeout: 480 seconds]
<eric_engestrom> bentiss: is the marge configuration somewhere public?
<eric_engestrom> jenatali: another test-dozen-deqp where all the tests start going Missing: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/52430475
<eric_engestrom> I killed it and retried it, hopefully it will have time to finish for that MR
<jenatali> eric_engestrom: :( unfortunately I'm still pretty sure this is a runner problem, not a dozen problem, which really means it's out of my hands to address
<eric_engestrom> daniels: thanks! `values/marge-bot/run_marge.sh` is what I was looking for
<eric_engestrom> jenatali: yeah I think you're right that it's a runner problem; I didn't realize you didn't maintain the runner though, sorry for the pings :]
bmodem has joined #freedesktop
Inline has joined #freedesktop
<daniels> eric_engestrom: alatiera (& the GSt project generally) maintain the Windows runner(s)
<daniels> currently it's singular, hence the long queue, and gst has also been a very noisy neighbour up until late, hence the massively variable runtimes
alatiera has joined #freedesktop
<eric_engestrom> ack, thanks!
<jenatali> Microsoft does provide the licenses for Windows on the runners though :)
<eric_engestrom> (I love how you summoned them into the channel :P)
<eric_engestrom> haha jenatali
<alatiera> it's magic
<jenatali> Which I mean to say, through partnership, not sales lol
<eric_engestrom> very off-topic, but I thought MS had dropped the idea of windows licenses now?
<alatiera> I saw the matrix ping and realized my irc had dced
<jenatali> Nah, it still has to be purchased once per machine, and then that machine can upgrade forever
<eric_engestrom> ack
* eric_engestrom hasn't installed windows anywhere in... oof, I'm getting old
<jenatali> Which pays my salary so I can't really complain about this business model too much
<eric_engestrom> hehe
<pinchartl> the last time I bought a machine with windows, I had to battle for a year to get reimbursed for the OS license that I ws forced to get
<pinchartl> jenatali: hopefully that didn't affect your salary :-)
<jenatali> Hah
<pinchartl> it was a loooooong time ago, before I was old and grumpy
<pinchartl> I was young and grumpy I suppose
<jenatali> If it was that long ago then it was probably before I was even here
<jenatali> Though I am coming up on 12 years here in a few months... Crazy how time flies
<karolherbst> is the label maker dead? :'(
<bentiss> karolherbst: it shouldn't
<bentiss> karolherbst: which MR?
<karolherbst> maybe I didn't wait enough
<bentiss> indeed, crashloopback
<karolherbst> ahh
<bentiss> value: Error { kind: Io(Os { code: 24, kind: Uncategorized, message: "Too many open files" })
<bentiss> sigh...
<bentiss> ok, fixed now
<bentiss> that's the second time in ~a week that this node has that issue
<karolherbst> pain
<karolherbst> just increase the fd limits then... 🙃
<karolherbst> or is that like the hypervisor complaining?
<bentiss> there is no hypervisor here
<bentiss> I just wonder which process is consuming all of the fds
<karolherbst> I see
<emersion> the limit is per-process usually
<emersion> maybe dmesg logs that?
<bentiss> that process shouldn't use a lot, and it was actually in a boot loop, so just trying to get 1 fd
<emersion> hm, probably not
<bentiss> https://github.com/timsueberkrueb/webber/issues/62 seems to give a command to get that info
<emersion> this error usually indicates a FD leak
<emersion> inside the process
<pinchartl> I'm trying to run a job locally, with a container image from the registry. podman can run the container fine (with a naive 'podman container run -it $registry_url bash'), but the qemu I start in the container complains that it can't initialize KVM. indeed, there's no /dev/kvm in the container. I have a /dev/kvm, and qemu can use it fine on the host. what am I missing to use kvm inside the container
<pinchartl> ?
<bentiss> right now, on that runner the command I linked above gives me only 144 fds... so unlikely to be correct when journalctl -f still complains
<karolherbst> maybe some config being wrong?
<karolherbst> worst case, strace the bot and see what happens
<bentiss> karolherbst: it's not the bot, it's something else on the machine
<bentiss> the bot is just a side effect of not being able to get an fd
<karolherbst> could be some funky python bug, but yeah...
<karolherbst> bentiss: try `sysctl fs.file-nr`
<emersion> i don't think you'd see this error if another process was responsible
<bentiss> karolherbst: fs.file-nr = 1334409223372036854775807
<karolherbst> `lsof | wc -l` might also be worth a try
<karolherbst> bentiss: what...
<karolherbst> bentiss: it output three numbers, right?
<karolherbst> ohh wait...
tzimmermann has quit [Quit: Leaving]
<karolherbst> IRC being IRC I guess
<bentiss> fs.file-nr = 13344 0 9223372036854775807
<bentiss> yeah
<karolherbst> mhh
<karolherbst> check lsof then
<bentiss> on it
<bentiss> that's a lot of "lsof: no pwd entry for UID 65535"
<karolherbst> mhhh
<bentiss> well, not just 65535
<karolherbst> could be some container stuff
<bentiss> lsof -l -> 1403280
<karolherbst> that's a lot of entries
<emersion> ls -l /proc/$(pidof …)/fd
<emersion> coudl help
<karolherbst> though on my desktop I have like 721047
<karolherbst> ` ls -l /proc/*/fd | wc` :D
<karolherbst> but yeah...
<karolherbst> could help with figuring out if something uses a lot though
<bentiss> 37927 202864 1531104
blatant has quit [Quit: WeeChat 4.1.2]
<karolherbst> mhhh
<karolherbst> what's the fd limit of the system?
<bentiss> cat /proc/sys/fs/inotify/max_user_watches -> 1048576
<karolherbst> that's inotify tho
<karolherbst> there is `/proc/sys/fs/file-max` for file handles, but not sure if that's the same as fds
<bentiss> cat /proc/sys/fs/file-max
<bentiss> 9223372036854775807
<karolherbst> yeah and `ulimit` probably says "unlimited"
<bentiss> which seems a lot more :)
<bentiss> yep
<karolherbst> yeah, so it's unlikely you hit a global fd limit
<bentiss> if I were, I couldn't ssh to the box, no?
<karolherbst> yeah
<karolherbst> probably
<karolherbst> so yeah.. either a python bug or a bug in the bot is most likely here
tzimmermann has joined #freedesktop
<bentiss> well, there a re a lot of processes with a lot of opened sockets
<bentiss> not sure if they count in the inotify
<bentiss> I wonder if I should not just bump the max_user_watches
<bentiss> https://github.com/mikesart/inotify-info seems interesting :)
<karolherbst> bentiss: inotify is an API to track changes to filesystems, it's not related to any fd limits
<bentiss> so not inotify related?
<karolherbst> no
<karolherbst> max_user_watches is just the limit of how many watchers on events there can be in total
<karolherbst> however
<bentiss> k, and the tool above returns Total inotify Watches: 5559
<bentiss> Total inotify Instances: 238
<karolherbst> I don't know what python does which triggers the value: Error { kind: Io(Os { code: 24, kind: Uncategorized, message: "Too many open files" })
<bentiss> it's rust, not python
<karolherbst> ahhh
<karolherbst> then it's rust
<karolherbst> I'd check with strace and see what fails
<bentiss> well, journalctl -f also fails
<karolherbst> mhhh...
<karolherbst> there is definetly something funky going on
<bentiss> that was my point: the bot is gone, and I still have issues on the server
<karolherbst> yeah...
<karolherbst> check strace then
<karolherbst> this error can mean anything at this point
<bentiss> strace on what?
<karolherbst> whatever fails
<karolherbst> so maybe journalctl -f?
<bentiss> inotify_init1(IN_NONBLOCK|IN_CLOEXEC) = -1 EMFILE (Too many open files)
<karolherbst> uhhh... okay, so it is inotify related after all...
<bentiss> don't know why hookiedookie would use inotify though
<karolherbst> yeah.. me neither
<bentiss> it's a webserver, so a socket would do
<karolherbst> ohh
<karolherbst> maybe just listen for files changing on disk
<karolherbst> to update the RAM cache or something
<karolherbst> or regenerate files or whatever
<bentiss> my memory muscle started typing "ps -aef"... and ps -aef | grep gpg | grep defunct | wc -l
<karolherbst> mhhh
<bentiss> 6117
<karolherbst> I wonder which of the inotify limits you are hitting.. let's see...
<bentiss> I wonder why I have so many gpg defunct processes
<karolherbst> "The user limit on the total number of inotify instances has been reached. "
<karolherbst> mhhh
<karolherbst> that's EMFILE
<bentiss> yeah, that matches the numbers I gave earlier
<karolherbst> sysctl fs.inotify
<karolherbst> `fs.inotify.max_user_instances = 128` I guess?
<bentiss> oh, I know why I use inotify in hookiedookie: it looks for changes in the Settings.tmpl file to reload itself in case there is a change
<karolherbst> try bumping that
<bentiss> karolherbst: that helped for journalctl -f > no more errors
<karolherbst> cool
<bentiss> what would be a reasonable value?
<bentiss> 256? 512? 1024?
<karolherbst> maybe try 1024 and see if that's enough?
<bentiss> k, I'll bump it on all of the nodes
<karolherbst> I don't know what's reasonable here but 128 is apparently not enough :)
<bentiss> I just need to remember how to make that setting persistent
<karolherbst> /etc/sysctl/
<karolherbst> uhm..
<karolherbst> sysctl.d/
<bentiss> /etc/sysctl.conf and sysctl.d
<bentiss> yeah
<bentiss> k, bumped on every node, we'll see if that fails once again
<karolherbst> hopefully it won't :)
vkareh has joined #freedesktop
<bentiss> eric_engestrom: I don't understand your comment at https://gitlab.freedesktop.org/freedesktop/fdo-bots/-/merge_requests/15#note_2195867
<bentiss> well, I think I do
<bentiss> let me reply
bmodem has quit [Ping timeout: 480 seconds]
<DavidHeidelberg> jenatali: is this happening often or it's rare that dozen job takes ~ 45 min https://gitlab.freedesktop.org/mesa/mesa/-/jobs/52436612 ?
<jenatali> DavidHeidelberg: See discussion from 1hr ago, there's runner overload that's being addressed
<jenatali> When overload isn't happening, it's closer to 15min
<karolherbst> sadly the job continues to run if marge times out, so it would help to kill the jobs on the runner
<karolherbst> but I also kinda wished marge would do that when moving to the next MR
<karolherbst> like kill all still running jobs on the last MR if marge decides to move on
<DavidHeidelberg> kk thx
<bentiss> karolherbst: not sure if you saw, but now if you push changes to main of https://gitlab.freedesktop.org/freedesktop/marge-bot this will get automatically deployed, so it's just a MR away????
<karolherbst> I wished I had time this year for working on marge :')
<bentiss> though we probably want to keep it close to upstream
<bentiss> the marge we are using is 7 months old, not sure what happened upstream since
<karolherbst> ~~maybe once I'm on PTO~~
<karolherbst> fair...
<bentiss> PTO is for resting, not marging
<karolherbst> I know
<karolherbst> and I'm on PTO next week
<bentiss> nice
<karolherbst> sooo...
<karolherbst> not sure I find some time to work on marge. however if anybody feels like working on marge, I think implementing what I mentioned would help tremendously in situations like this
<bentiss> karolherbst: anyway, thanks heaps for the help of the max_user_thingy
<karolherbst> no problem. remember to use strace for cases like this as it is a power tool :D
<bentiss> heh, I'll do :)
blatant has joined #freedesktop
<eric_engestrom> karolherbst, bentiss: I added `--cancel-pipeline-on-timeout` to Marge: https://gitlab.com/marge-org/marge-bot/-/merge_requests/411 :)
<eric_engestrom> not tested yet though, and not really reviewed either
<karolherbst> cool!
<eric_engestrom> hopefully when I'm back in january I can test it and get upstream to merge it
<eric_engestrom> I also have a branch based on that one that adds `--job-failure={warn,abort,ignore}` so that we can either cancel the whole thing when a job fails, or at least post a message in the MR so that users can retry asap
<eric_engestrom> a future improvement will be to add `--job-failure-delay-abort N` to give some time between the job failure and the abort, eg. 10min, so that we don't waste too much resource but also give a chance to retry
<karolherbst> I think my idea was to only do that once marge actually moves to the next MR
<eric_engestrom> yeah that's what the MR I posted does
<eric_engestrom> but I was talking about other things we can also do
<karolherbst> right..
<karolherbst> yeah.. I don't really know, I think reducing the load on the CI system is more important than having jobs continue to run for a while which might pass anyway
<eric_engestrom> yeah, "reducing ci load" has been my focus for the last 2-3 months
<eric_engestrom> this morning I merge the MR that stops mesa from re-running all the test right after merging (since we just ran them to get to the point where the MR is merged)
<karolherbst> yeah.. that should help a lot
<karolherbst> it makes sense to do that if we wouldn't rebase, but as we do...
<eric_engestrom> hopefully with that 2x waste, the 50% reduction in ci load will result in noticable improvement in marge pipeline times
<karolherbst> yeah
<karolherbst> so the only jobs which run post MR are like gitlab pages stuff and such?
<eric_engestrom> exactly
<karolherbst> good
<eric_engestrom> no "and such", that's it
<eric_engestrom> just `pages`
<karolherbst> yeah.. 2x in capacity should get us going for a while :)
<eric_engestrom> well, it's only _actually_ 2x capacity increase in the case of back-to-back MRs like we've had the last couple of days
<eric_engestrom> when there's time in between MRs the difference isn't that high
<jenatali> Which have partially been my fault (kinda) :(
<karolherbst> eric_engestrom: well.. isn't that what capacity means?
<karolherbst> like how many MRs we could merge at most in a day :)
<eric_engestrom> yeah I guess
<eric_engestrom> ^^
<eric_engestrom> jenatali: not your fault since you're not the one maintaining that runner :P
privacy has joined #freedesktop
<jenatali> Yeah but I maintain the test pass and the code under test
lsd|2 has joined #freedesktop
AbleBacon has joined #freedesktop
blatant has quit [Ping timeout: 480 seconds]
bilboed has quit [Quit: The Lounge - https://thelounge.chat]
immibis_ is now known as immibis
immibis is now known as immibis_
<eric_engestrom> jenatali, alatiera: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/52442924 -> the windows build job started almost half an hour into the pipeline (all the other test jobs have finished) and it hasn't started actually compiling yet, after 10+ minutes
<eric_engestrom> should we consider the windows farm offline?
<eric_engestrom> haha, it heard me I think, it just started compiling
immibis_ is now known as immibis
<alatiera> Queued: 8 minutes 16 seconds: Not that bad given that we are at half the runners atm
<eric_engestrom> that's on top of the 7+10+8 pending of the previous windows jobs in the chain :/
<eric_engestrom> (minutes)
jarthur has joined #freedesktop
<eric_engestrom> jenatali, alatiera: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26549 I know it's not ideal, but I think it's necessary :(
<eric_engestrom> I have to go for a bit, jenatali I'll let you assign it to Marge if you approve
<jenatali> I don't like it but yeah the single runner just can't handle gst and Mesa
<eric_engestrom> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26360 test-dozen-deqp hasn't even started yet and there's 12 minutes left :(
<alatiera> @eric_engestrom go for it if you think it's necessary
<eric_engestrom> ack, I'll merge it now
<alatiera> the runner is at constant 100% cpu utilization as you'd expect
<eric_engestrom> :(
<jenatali> eric_engestrom: I would assign it, but I can't from my phone. When I click edit on the assigned to field, and the keyboard pops out, it closes the side bar
<eric_engestrom> yeah I have the same web ui bug on my phone
<eric_engestrom> I usually just post a comment with `/assign @marge-bot` instead
<eric_engestrom> but I just merged it so no need to do anything
<jenatali> :O That's a thing? That's cool
alyssa has joined #freedesktop
<alyssa> eric_engestrom: do we need to cancel pipeline of currently running MR since it won't merge now?
<alyssa> oh beat me to it
<alyssa> :p
mvlad has quit [Remote host closed the connection]
<eric_engestrom> :)
blatant has joined #freedesktop
tzimmermann has quit [Quit: Leaving]
Haaninjo has joined #freedesktop
<jenatali> alatiera: Please let me know as soon as you expect any kind of improvement, since I'd like to re-enable Windows CI as soon as it won't cause issues for folks
<jenatali> The longer it's off, the more people make changes that break my drivers or MSVC builds :)
<karolherbst> I think the plan was to disable post merge pipelines and after that I'm sure it's fine to reenable?
damian has joined #freedesktop
<karolherbst> what's blocking that anyway?
<alatiera> We have disabled post-merge in gst for a couple years now, and only do a nightly schedule just to make sure
<alatiera> and the schedule is very recent we were fine without it
<alatiera> jenatali: ack, currently wiped the runner and started again from scratch cause I couldn't make it to work, so dunno
<MrCooper> Mesa's post-merge pipeline is mostly empty now, only the pages job if needed
<jenatali> karolherbst: Windows jobs have been turned off (except the build I think?) for post-merge anyway
<karolherbst> ohh that already landed
<jenatali> The problem is the Dozen job is super CPU-intensive, and it fights with other stuff that's running on the system, and if there's only one runner it gets too busy to handle that job appropriately
<karolherbst> I see
<MrCooper> hmm, that sounds like a gitlab-runner misconfiguration? The same number of instances of the job can end up run concurrently regardless of which pipelines the job does (not) exist in
mvlad has joined #freedesktop
<alyssa> maybe dozen needs dedicated runners?
blatant has quit [Ping timeout: 480 seconds]
<eric_engestrom> oops, sorry about gitlab being a bit unresponsive, I think that's my fault for sending a bunch of requests at once from a script
<eric_engestrom> I imediately killed the script so it will come back soon
<eric_engestrom> yeah looks like it's back to normal :)
blatant has joined #freedesktop
<eric_engestrom> karolherbst:
<eric_engestrom> > I think the plan was to disable post merge pipelines and after that I'm sure it's fine to reenable?
<eric_engestrom> > what's blocking that anyway?
<eric_engestrom> if you're talking about mesa, https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26451 was merged this morning :)
___nick___ has quit [Ping timeout: 480 seconds]
<DavidHeidelberg> eric_engestrom: are you sending more, because GL look pretty dead "D
<eric_engestrom> DavidHeidelberg: it works fine for me now
<DavidHeidelberg> lucky u
<DavidHeidelberg> ok, good recovered, but minute ago it got stuck on loading page
<eric_engestrom> it was unresponsive for 2-3 minutes but it's been fine for a while now
<eric_engestrom> well, "while" = 5+ minutes
<DavidHeidelberg> bentiss: Can we help somehow? If it would be meaningful I would provide some server or ask on conferences if any corp want to put some extra $ into FDO? :)
<bentiss> DavidHeidelberg: for the gitlab instance in itself, we would welcome any extra runner, but having more nodes for the cluster would require them to be hosted on Equinix datacenters
<bentiss> though I think if we get extra runners, we could trade some of the runners for k8s nodes
blatant has quit [Quit: WeeChat 4.1.2]
ximion has joined #freedesktop
<jenatali> alyssa: I wish I could make that happen...
<eric_engestrom> weren't we talking about all the money microsoft has earlier? :P
<jenatali> It's not strictly money, it's a skillset and time for managing a machine as well
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<daniels> ^
<airlied> just have 10 windows runner machines, with another machine that is cycling through and reinstalling them all from scratch :-P
<jenatali> We actually have an internal tech that sets up machines to dual-boot. You join a minimal OS environment as a runner, and one of the things you can tell it to do is to reboot and install an OS and join that as a runner
<daniels> yeah, having something like that would be great - surprisingly fd.o people are not natural Windows admins
<daniels> the last time I did it was NT4
<pinchartl> the last time I had to administer a windows machine, it required a keyboard and a mouse
<jenatali> Let me ask around a little bit and see if we can contribute more than just licenses. It'd be nice to get some proper Azure compute time on beefy machines for Windows CI
<jenatali> I wouldn't really expect an answer before the new year though, things shut down around here come December
<airlied> unless we can somehow tie it to OpenAI :-P
i509vcb has joined #freedesktop
<alyssa> jenatali: then maybe dozen fraction needs to be bumped
* alyssa shrugs
<jenatali> Wouldn't matter. That'd decrease the time, but when the machine is under heavy load, some test results start to go missing
<alyssa> maybe dozen shouldn't be ci'd upstream then yet
<alyssa> (i'm sympathetic to the challenges of running ci at mesa/mesa scale, this is a major reason why asahi ci upstream is not in the cards)
<jenatali> I dunno. Seems like even just the build jobs were having problems
alyssa has quit [Quit: alyssa]
vkareh has quit [Quit: WeeChat 4.1.1]
sima has quit [Ping timeout: 480 seconds]
Haaninjo has quit [Quit: Ex-Chat]
thelounge14738 has quit []
thelounge14738 has joined #freedesktop
privacy has quit [Quit: Leaving]
itaipu has joined #freedesktop
<jenatali> alatiera: I think I've asked this before, but what's the config of the machines that host the Windows CI? I'm going to make a run at asking for resources from our side for Mesa (at least) and would probably want something comparable
<alatiera> I will find and send you tmr the details
<jenatali> Thanks!