ChanServ changed the topic of #freedesktop to: https://www.freedesktop.org infrastructure and online services || for questions about freedesktop.org projects, please see each project's contact || for discussions about specifications, please use https://gitlab.freedesktop.org/xdg or xdg@lists.freedesktop.org
anholt has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
co1umbarius has joined #freedesktop
ximion has quit [Quit: Detached from the Matrix]
columbarius has quit [Ping timeout: 480 seconds]
ximion has joined #freedesktop
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
systwi has joined #freedesktop
systwi_ has quit [Ping timeout: 480 seconds]
ximion has quit [Quit: Detached from the Matrix]
chip_x has joined #freedesktop
chipxxx has quit [Ping timeout: 480 seconds]
karolherbst has quit [Read error: Connection reset by peer]
karolherbst has joined #freedesktop
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
agd5f_ has joined #freedesktop
agd5f has quit [Ping timeout: 480 seconds]
<bentiss>
sergi: mind if I stop all of your scheduled pipelines in https://gitlab.freedesktop.org/gfx-ci-bot/mesa? You are basically DoS all the farms by running 1 mesa pipeline every 10 minutes when it takes ~50 min to run
<bentiss>
actually I'm not even asking
<bentiss>
sergi: and on top of that: ERROR: Job failed: failed to pull image "alpine:latest" with specified policies [always]: Error response from daemon: toomanyrequests: You have reached your pull rate limit.
thaller has joined #freedesktop
<bentiss>
so not enough DoS the farms, you even locked us up from pulling from docker.io by adding a ton of pulls :(
thaller has quit [Quit: Leaving]
thaller has joined #freedesktop
danvet has joined #freedesktop
<sergi>
Hi bentiss, I'm not DoSing but diagnosing the farms. Or at least this was the purpose. I've prepared last week a pipeline that launches a job per tag of runners, that onely does an echo. This has been made to see how the scheduler work because with the team we've been seeing issues with the pick from queues or how the jobs are enqueued
<bentiss>
sergi: yes, I understand the rational behind it, but this is DoS the farm
<bentiss>
we had ~700 pending jobs, ~650 were yours
<bentiss>
and so the regular mesa pipelines can't run anymore
<sergi>
I see. The initial idea was to schedule it every 2 minutes, but I schedules every 10 thinking that I may mean too much. When 10 were already too much
<bentiss>
your idea kind of works when the farm are alive and not used, but as soon as they are heavy used, you are creating more jobs than it can handle, and if a farm goes down, you still have hundreds of pending jobs on that farm that can't be executed
<sergi>
It seems the attempt to help in diagnose has even contribute to the problem
<bentiss>
sergi: honestly, once per hour would be ok-ish, but not sure it'll help
<bentiss>
sergi: the solution would be to have either a dedicated runner per farm that can handle those jobs out of the capacity of the other runners, or have a dedicated machine in each farm that can ping/report values of the runners
<bentiss>
but it's school dropoff time here, I'll be back later
<sergi>
thanks for managing that and sorry for contributing to the issue.
<sergi>
see you later
<nirbheek_>
Hugs to all admins today, thank you for handling the runner issues over the weekend <3
<bentiss>
sigh... so we've got one user claiming he's going to work on mesa and camera, but used his brand new privileges for building android testing... :(
<bentiss>
daniels: looks like JaswantTeja escaped the sandbox
<bentiss>
that guy asked for privileges 2 days ago, and created 3 projects he since removed: ohing.git prawn.git and test.git
* mupuf
is happy he asked users to specify their intent
<bentiss>
FWIW, JaswantTeja also requested access
<bentiss>
I wonder if we should force the issue to be public
<nirbheek_>
> I can't create new projects or fork an existing one. I am not a spammer and I want to contribute, so please add me to the list of internal users!
<nirbheek_>
What an ass
<bentiss>
nirbheek_: he is following a template we give them
<nirbheek_>
Ahh, I see, that text is from the template
<nirbheek_>
His description was `As a linux engineer, i need to work on it & contribute for the community of freedesktop. So, it will be a great opportunity if I get the access. Thanks in Advance!`
<bentiss>
nirbheek_: the "As a linux engineer, i need to work on it & contribute for the community of freedesktop. So, it will be a great opportunity if I get the access. Thanks in Advance!" is fully from him
<nirbheek_>
Should we require new users to be either vouched by an existing user or prove identification in some other way?
<bentiss>
nirbheek_: the initial idea was to discriminate bots/humans
<nirbheek_>
Ah, I see
<bentiss>
but I think I'm going to lock down the runners even if it means that virglrenderer pipelines will fail
<bentiss>
cause escaping the sandbox is *really* worrying
<bentiss>
the thing though is the arm runners don't have a recent enough podman version :(
<daniels>
MrCooper: the username does match the user who filed the GitHub PR, and it's an @qq.com which matches China rather than our current friends, so I'd be inclined to say it's OK
<daniels>
bentiss: so back to the scheduled pipeline, one thing we didn't realise is that 'auto-cancel redundant pipelines' seemingly doesn't apply to scheduled runners
<daniels>
there's definitely a bug in GitLab itself where it just doesn't schedule jobs on tagged runners sometimes, no matter how much free capacity there is
<daniels>
the only way to get the job scheduled is to schedule another job for the same runner, which bumps the check timestamp the runner passes, and forces it to see that there are more jobs available
<daniels>
you can see this when you have jobs that have been queued for 30+min, then you cancel it and retry it and it gets processed instantly
<daniels>
I've seen on Rails that if you query the job queue for that runner (i.e. executing exactly what the runner endpoint would've done), that there is a job there queued for it - but it never checks the queue because the runner passes a last-check token which is the most current one, so it skips checking the queue and just tells the runner there are no more jobs available
<daniels>
the jobs in sergi's pipeline are dummy jobs - they just echo and return immediately, taking no time at all - they're only there to pump the job queue
<alatiera>
lol
<zmike>
is swrast ci operational?
<sergi>
bentiss and daniels, yes my jobs were dummies but I haven't thought about the effect on the "toomanyrequests" to docker.io
<bentiss>
daniels, sergi: sorry I was worried this morning about the DoD, and took immediate actions. But again, I'm not against such thing, and if that's the only way of solving the issue, then yes, we should go for it
<daniels>
zmike: sadly no, waiting for anholt to get back so I can figure it out with them
<zmike>
alrighty
<daniels>
bentiss: yeah, no problem :) it is the dumbest possible solution for sure, but I can't see much else atm. we can rework it so it pulls one of the ci-templates from harbor at least, which should be a no-op on all the runners as it'll always be cached?
<bentiss>
It's just that it was one more thing to add to the pile of crap we had since the beginning of the year, between spammers and hackers... And I would expect a minimum of monitoring for these jobs
<bentiss>
yeah, using harbor is fine
<daniels>
bentiss: yeah, I asked sergi to put it in place to solve one of the biggest fires we had in Mesa since we run enough jobs (on isolated hw-specific runners which don't see new jobs into the queue until the next MR ...) that it triggers fairly regularly - given that he has children at home and I don't, the monitoring was that I keep an eye on it, but I was somewhat distracted as you know :P
<daniels>
sorry about that
<bentiss>
no worries
<bentiss>
BTW< the one thing I also disliked was the gfx-ci-bot is actually a real account, linked to sergi's email
<daniels>
bentiss: you'd prefer it was a project bot account?
<bentiss>
that's not a good way of doing things. We should have used a proper bot
<daniels>
mmm, we originally did that, but gfx-ci-bot does quite a few things, including needing to be able to push to multiple repos
<bentiss>
yeah, because if anything happens to sergi or if he loses his email credentials we are screwed
<daniels>
tbf the credentials are in Collabora's BitWarden, but yeah, I get the point
<bentiss>
right...
<daniels>
should we just make the email gfx-ci-bot@no.invalid?
<bentiss>
maybe we can find a better middle ground then
<bentiss>
@collabora.com would be better, no?
<bentiss>
because I'm sure that if I see such an email in a user account that does DoS, I'll just nuke it :)
<bentiss>
gfx-ci-bot-collabora@no.invalid if you prefer
<daniels>
but yeah, for example one of the things gfx-ci-bot does is, if there have been changes in piglit/virglrenderer, then it'll push to some repos to build a testing MR of Mesa using updated versions of those two projects, then raise an MR on Mesa to bump the dependencies - we can't do that with a project token
<daniels>
bentiss: yeah, that's a good idea
<bentiss>
daniels: I just spent the past 2 hours discussing with whot about my gitlab-validate-users project I mentioned
<bentiss>
and this rust part is becoming a generic webhook facility for gitlab
<sergi>
I thought the use of my email address in the bot helps to identify me in case something went worng
<bentiss>
basically, you could register that webhook in piglit/virglrenderer for code push, and if it gets a change, it'll run any python script you want with the proper credentials
<sergi>
And, now it's long a go, I prefered not to use my user, because this uprev project is a team tool and should be a single person project
<daniels>
sergi: yeah, and you can't register with an invalid email so that's fair enough - I've made it go to one which contains the contact details but also ends in .fdo.invalid, so we'll never try to route e.g. a password-reset email
<daniels>
bentiss: interesting - wouldn't that require something like Vault to get a token which could then create a MR on Mesa though?
<bentiss>
sergi: yes, and it's valuable, but if you look into my eyes this morning: "oh, there is a user nmamed gfx-ci-bot DoS the farms, it must be legit. Well, wait a minute that user is an actual user, so is it a hacker?, let's just nuke, talk later"
<sergi>
bentiss, it's a newbie XD
<daniels>
haha
<sergi>
I mean, I'm the newbie
<bentiss>
daniels: right now I intend to have the token in the yaml config, stored on an internal tree (or just in kuberenetes), so one token per hookiedookie instance
<daniels>
well, one thing to do at least is to nuke all the invalid tags which will never get scheduled
<bentiss>
(we plan on renaming it hookiedookie)
<daniels>
bentiss: gotcha, nice
<sergi>
I understand how this morning was impossible to distinguish if this behaviour was for good or for bad. But for sure, it has disturbing the infra
<bentiss>
and as we were talking this morning, I also realized I could watch for the pushes in the config repo, and automatically reload it through a webhook :)
<daniels>
bentiss: oh that's cute, and might resolve my current wondering of how to auto-update various bots and tasks we have running in k8s
<daniels>
(e.g. it would be nice if the marge-bot update process was something more nuanced than 'ask daniels to change the tag')
<bentiss>
sergi: again, don't blame yourself, we all make mistakes. The fact that I recognized the name prevented me to nuke the account, so don't worry.
<bentiss>
daniels: I know it would be useful to others :)
<bentiss>
whot: told you ^^ :)
* bentiss
needs to grab some lunch, bbl
<alatiera>
bentiss, daniels another thing I wanted to do for a while, was to default images created with ci-templates to run as non-root users
<daniels>
alatiera: fr fr
<daniels>
I'd need to check how that works with nested KVM, since we depend on that heavily for both Mesa and Weston
<alatiera>
last I looked at it there was some buildah issue
<daniels>
but yeah, if we can make it work then I'd be completely on board with enforcing unprivileged across the board
<daniels>
(& will keep looking into how to make Kata fly today, had some v bad timing with some other work stuff which has dragged me away but trying to come back to it)
<alatiera>
np np
<alatiera>
my plan is to wipe one runner and set it up as group runner so at least MRs can be back working
<alatiera>
which should buy us some time to figure out how to lock down the rest
<daniels>
cool cool, I'll let you know how I get on with Kata - you should be able to use the cloud-init stuff to provision new htz runners as well
<alatiera>
yea
MajorBiscuit has joined #freedesktop
vkareh has joined #freedesktop
agd5f has joined #freedesktop
agd5f_ has quit [Ping timeout: 480 seconds]
agd5f_ has joined #freedesktop
vbenes has quit [Remote host closed the connection]
agd5f has quit [Ping timeout: 480 seconds]
vbenes has joined #freedesktop
Leopold_ has quit [Remote host closed the connection]
vbenes has quit [Quit: Leaving.]
vbenes has joined #freedesktop
Leopold has joined #freedesktop
<MrCooper>
ugh, now ccache is hanging in F36 containers as well
<daniels>
MrCooper: job?
<daniels>
if it's still running I can SSH in and stare at it
<MrCooper>
I was hitting this when I tried upgrading to F37 first, that's why I settled for F36 for now
<MrCooper>
looks like ccache has been upgraded from 4.5.1 to 4.7.4 in F36
<MrCooper>
which is the same version as in F37
mohamexiety has joined #freedesktop
<mohamexiety>
daniels: hey again! not sure if you remember but a few days ago I came with a weird issue where I was having a really really bad connection to the fdo gitlab. you mentioned that maybe there was some node in the middle that was having issues or so and suggested looking for alternative routing
<mohamexiety>
well I tried cloudflare WARP and now everything works well, so thanks a lot again!
<MrCooper>
can't reproduce locally in podman either
<MrCooper>
would dropping ccache from the fedora image be acceptable for now?
<daniels>
MrCooper: yeah, just do that for now, it's not on the critical path to hardware testing anyway
<daniels>
MrCooper: thanks for looking into it :)
<MrCooper>
no worries, thanks
<Wallbraker>
Seems like something happened to some of the Windows runners on the CI, got a job stuck because there no runners for the tags: docker, windows, 2022.
<daniels>
Wallbraker: yeah they're gone for now, awaiting a rebuild
<Wallbraker>
Hours, days, weeks?
<Wallbraker>
Thanks for the info!
<daniels>
absolutely no clue I'm afraid
agd5f_ has quit []
<daniels>
but weeks would be surprising
agd5f has joined #freedesktop
<Wallbraker>
Heh, I was being pessimistic. :p
<Wallbraker>
Okay I'll disable the windows build if things doesngive it a few hours before disabling the windows
<Wallbraker>
Okay I'll disable the windows build if things doesn't improve by tomorrow.
<eric_engestrom>
thanks to DavidHeidelberg[m]'s `retry: 1`, anything that fails gets auto-retried once by gitlab
<MrCooper>
sigh, thanks
<daniels>
(that's backed up by a bunch of automated monitoring we have that e.g. tells you which specific dEQP cases flaked when you retried a job and it succeeded; we're not just blindly retrying for no reason)
<alatiera>
Wallbraker unlikely they will be up by tomorrow
<Wallbraker>
alatiera: Okay, thanks for letting me know. Any reason for the outage?
<alatiera>
the runners are fine-ish from the look of it, unlike the linux ones, but I wanna wait for the idiot poking at our runners to lose interest before enabling them again
<alatiera>
I guess could enable one of them since I am gonna rebuild them eventually anyway
<alatiera>
or rather rollback the snapshot
<Wallbraker>
Somebody hacking/DoSs the runners?
<Wallbraker>
Also that would be great.
<daniels>
Wallbraker: yep
<alatiera>
yea indeed, lets enable one of them and see how it goes
<Wallbraker>
Sigh
<Wallbraker>
Thanks for that!
craftyguy has quit [Remote host closed the connection]
craftyguy has joined #freedesktop
craftyguy has quit []
<alatiera>
Wallbraker, daniels one of the windows runner is up
<alatiera>
will try to keep an eye on usage
<alatiera>
and will revert the vm to the snapshot if anything goes bad
craftyguy has joined #freedesktop
<Wallbraker>
Our job succeeded, thanks!
jarthur has joined #freedesktop
<anholt>
daniels: so, what's the summary from the weekend? I've still got mesa-swrast turned off.
Leopold has quit [Remote host closed the connection]
genpaku has quit [Read error: Connection reset by peer]
Leopold_ has joined #freedesktop
genpaku has joined #freedesktop
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
rcampbell has joined #freedesktop
rcampbell has quit []
rcampbell has joined #freedesktop
rcampbell has quit [Remote host closed the connection]
rcampbell has joined #freedesktop
<bentiss>
anholt: basically we have been hacked, we kicked them out and we set everything back is a longer todo list to make runners more secure, which will be a pain to everybody
<bentiss>
s/everything back is a/everything back *with* a/
<anholt>
bentiss: so, containers escaped?
<bentiss>
yep
<anholt>
that's what I was seeing on mesa-swrast
<bentiss>
anholt: nirbheek managed to talk to them, and some are just kids (16) but "the brain" is like 20 and has no remorses
<bentiss>
so when you deal with that kind of ass, you don't have much choice
<daniels>
anholt: containers escaped, found out exactly who was doing it, banned them, know what to look for in future
<daniels>
anholt: I'm working on Kata but I very likely won't have that done today
<anholt>
ok, but we don't have a way to prevent the container escape?
<daniels>
^
* anholt
searches for kata, finds out
<daniels>
anholt: 'what if every container was also an ephemeral VM'
<anholt>
I mean, it seems like the obvious sensible thing
<daniels>
yeah, obvious, sensible, also comes with a number of hazards to do it properly
<anholt>
I'm shocked.
<daniels>
anyway, I'm going to figure that out and provide a cloud-init patch, hopefully tomorrow but might end up being more like Wednesday
<anholt>
well, now that I've figured out how to cloud-init, I can at least follow along I guess :)
<daniels>
itmt if you provision new ones, I'm staring at our current ones looking for any kind of escape and not seeing it, and we know which new users to look for
<daniels>
and then it's easy to either reprovision with kata, or if you have some way of sharing GCE access then I can do it
<anholt>
we still don't have any plans for how we get fd.o to be less 503-happy, do we?
mohamexiety has quit []
<bentiss>
anholt: we might be able to simply spin up more pods to handle the workloads, I just never done that with rados gateway
<daniels>
bentiss: could you please do that? it's pretty frequent, and we do seem to get long 502/503/504 blackouts - like 5min+ at a time
<bentiss>
daniels: honestly not today :(
<mupuf>
anyone having a clue what may have landed in mesa that would cause the valve vkcts jobs to spew "ERROR - dEQP error: error: XDG_RUNTIME_DIR not set in the environment." in a loop?
<bentiss>
I can try to work on that tomorrow
AbleBacon has joined #freedesktop
* mupuf
cancelled the marge job and will investigate that tomorrow
<daniels>
bentiss: yeah, no prob at all :)
<daniels>
mupuf: presumably having Wayland at least available as the window system, and dEQP trying to use that
<daniels>
mupuf: I suspect the answer is to set HWCI_START_WESTON in your job environment
<mupuf>
thx!
<daniels>
either that or just skip all the Wayland WSI tests
<daniels>
np
<daniels>
bentiss: thankyou very much <3
<bentiss>
daniels: thank you too for also handling this things :)
<daniels>
bentiss: team work makes the dream work
<bentiss>
heh
MajorBiscuit has quit [Ping timeout: 480 seconds]
<bentiss>
actually, this might be easy (multiple pods for radosgw)
<bentiss>
and I don't know if placeholder-job has been set up
<alatiera>
oh I see, thanks
* alatiera
hasn't looked enough into cloud-init yet
Kayden has quit [Quit: _> office]
abrotman has quit [Remote host closed the connection]
abrotman has joined #freedesktop
ybogdano has joined #freedesktop
mvlad has quit [Remote host closed the connection]
vkareh has quit [Quit: WeeChat 3.6]
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
<alatiera>
hmm added a group runner in gst and have an mr pipeline but it doesn't seem to pick up jobs from the mr
<alatiera>
it does pick up jobs in gst/gst though
<alatiera>
any ideas?
abrotman has quit [Remote host closed the connection]
abrotman has joined #freedesktop
<daniels>
bentiss: this is awesome, thanks a lot
<daniels>
alatiera: I've never tried to use group runners, sorry - it depends on the MR though as to whether the pipeline executes in user or group context - check the path for the pipeline if it says alatiera/gst or gst/gst
<alatiera>
supposedly the parent context pipeline is an EE feature
<alatiera>
but checking if it gets triggered if I have permissions
<daniels>
'Moved to GitLab Premium in 13.9.'
<daniels>
daaaaaaaaaaaaamn.
<alatiera>
ha! yeap only fork pipelines it is
<alatiera>
bentiss there goes the dream
<alatiera>
guess I will need a shared-runner token for now
<alatiera>
but will keep the runner unpriv at least
<alatiera>
still treat it as throwaway until we figure out something though
<bentiss>
honestly, we can probably have something like marge-bot, that forcces the pipeline to run in the context of the target project if we need. But I'm surprised it's premium only noe
<bentiss>
now
<daniels>
sent you the token
* bentiss
really wants to have hookiedookie ready now
<alatiera>
bentiss marge will trigger a fork pipeline too
<daniels>
bentiss: oh, that's a good point, and gst does have marge-bot - it would just need to be modified to push into a throwaway branch of the parent project rather than the downstream project
<alatiera>
we'd have to clone -> close -> create new
<bentiss>
because we could havea /run-pipeline, and we can have a way to copy the MR in the target, and run pipeline from it,
<alatiera>
and then I guess have the bot merge manually
<daniels>
alatiera: ugh yeah, of course
<bentiss>
but we could also use parent-child pipeline with git clone policy never and we provide our own sha
<alatiera>
ah hmm
<alatiera>
so the child pipeline would be on the fork I am guessing
<alatiera>
I think I recall an issue in gitlab for sharing the runners
<bentiss>
I honestly doesn't have the brain today for thinking through all the quirks, but basically we probably want: user submits a MR in his fork, a developer in the project approves it, this triggers a new pipeline in the target project with the given sha, and we wait fo rthe results
<__tim>
we could add the runner to marge and active devs, then it would work for merge requests
<__tim>
tedious though, and not great for drive-by patches, but what can you do
<bentiss>
actually even easier: if we set the pre-clone hook like I have in the issue, the shared runner will only execute if the MR is from a project member or is approved
<__tim>
right, I thought that wasn't quite ready yet
<bentiss>
The missing bit are: a proper repo to store that hook, the runners configuration to be extended
<bentiss>
so should be easy enough to implement, except I'm terrible at giving names to projects :)
<bentiss>
as long as you stick with grep and wget, it works
<bentiss>
alatiera: the nice thing is that groups can't be created by the normal users. So we have some control over them, and all groups are technically valid ones
<alatiera>
we say that all toplevel groups are autoallowed since only admins can make them
<bentiss>
yeah
<alatiera>
will take a stab at it
<bentiss>
and we can probably have some contacts with teh persons if there is an issue in a group
<bentiss>
anyway, it's been a long day here. I'll be back tomorrow I think
<alatiera>
the stupid thing is, that we can't even do "maintainers only trigger pipelines" like its on github
<alatiera>
sure we will get pipelines running automatically, but also the things we have to do to get there..
wizard5623 has joined #freedesktop
abrotman has quit [Remote host closed the connection]
abrotman has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
abrotman has quit [Remote host closed the connection]
abrotman has joined #freedesktop
DodoGTA has quit [Quit: DodoGTA]
DodoGTA has joined #freedesktop
Kayden has joined #freedesktop
Kayden has quit [Remote host closed the connection]
Kayden has joined #freedesktop
DodoGTA has quit [Remote host closed the connection]
___nick___ has quit [Ping timeout: 480 seconds]
DodoGTA has joined #freedesktop
ybogdano is now known as Guest7599
Guest7588 is now known as ybogdano
Leopold_ has quit []
Leopold_ has joined #freedesktop
Leopold_ has quit [Remote host closed the connection]
Leopold_ has joined #freedesktop
<alatiera>
so remeber when I said that parent pipelines from forks are EE? I lied
<alatiera>
it is Premium which is different apparently
<alatiera>
enabled on selfhosted but needs a subscription on gitlab.com
<alatiera>
so we could have group runners just for maintainers/developers I guess
<alatiera>
and throw away shared runners
<jenatali>
alatiera: Are the Windows runners in a good enough state to re-enable for Mesa? Or should we wait longer before doing that?
<alatiera>
jenatali I've turned one of them back on
<alatiera>
there isn't any sign they were tampered with
<jenatali>
Got it, so half capacity at the moment
<alatiera>
however I will roll back the snapshot at some point within the week
<jenatali>
Ok
<alatiera>
or if the kids come back
wizard5623 has quit []
Guest7599 has quit [Ping timeout: 480 seconds]
anholt has quit [Quit: Leaving]
anholt has joined #freedesktop
DodoGTA has quit [Quit: DodoGTA]
DodoGTA has joined #freedesktop
DodoGTA has quit [Remote host closed the connection]
DodoGTA has joined #freedesktop
Kayden has quit [Quit: go home before fire alarm tests]