alatiera4 has joined #freedesktop
alatiera has quit [Ping timeout: 480 seconds]
adjtm has quit [Ping timeout: 480 seconds]
adjtm has joined #freedesktop
thaytan_ has joined #freedesktop
thaytan has quit []
thaytan_ has quit []
thaytan has joined #freedesktop
jarthur has quit [Ping timeout: 480 seconds]
ximion has quit []
chomwitt has joined #freedesktop
sumits has joined #freedesktop
danvet has joined #freedesktop
chomwitt has quit [Quit: Leaving]
chomwitt has joined #freedesktop
ximion has joined #freedesktop
<bentiss> oh well, I realy need to move away from minio cluster: we are getting a new wave of 500 errors while uploading artifacts
<bentiss> daniels: FWIW, backup of the old minio-artifacts done, will kill the EBS attached to it
<daniels> ooh, just in time
<bentiss> the problem is: given that there are no more transfers, why are we still getting 500????
<bentiss> (besides minio-cluster not being something we should use)
blue__penquin has joined #freedesktop
<tomeu> I have noticed that when we are getting 500s on artifacts upload, jobs on x86 runners take a long time to get picked up
ximion1 has joined #freedesktop
ximion has quit [Read error: Connection reset by peer]
<daniels> bentiss: hmm yeah, I was seeing when things were bad yesterday that we'd get a 'Job succeeded' message in the trace from runners, without the job actually being marked as succeeded; about 5min later there'd be a second 'job succeeded' message and it would finally be marked as successful
<daniels> is there anything I can do to help with MinIO?
<bentiss> daniels: I am scratching my head on how to move the data without spinning too much s1.large
<bentiss> cause right now I only have one node with hdds in ceph, and I need 3
<bentiss> Ideally, I'd like to convert large-4 and large-3 into ceph, but for that I either need to scale down the md array,either need to migrate it's data elsewhere
<bentiss> actually, for large 4, it should be doable to scale down the md array
<bentiss> only 109 GB used on 22 TB aray, scaling down the fs should be doable
<daniels> can xfs shrink online though?
<daniels> I thought we'd need to take it offline
<bentiss> damn... that's what I am looking for
<bentiss> daniels: and I am sure you are going to tell me ext4 is capable of online shrinking?
<bentiss> looks like it's the same problem
<daniels> heh
<daniels> yeah, I haven't seen anything so far ... :(
<bentiss> daniels: question: do we care about the data (artifacts) we have on large-1, or we just consider that week being lost?
<daniels> just the last week, right? if so, I think it's ok to burn them to be honest, if it makes it easier to move quicker
<bentiss> cause I would gladly kill that machine and replace it with a new one for ceph
<bentiss> ack, when migrating to ceph, we'll try to get all the job logs from the backup (so without last week), and sync with the data from the current minio-cluster
* daniels nods
<bentiss> daniels: how does that sound: 1. I spin up 2 s1.large to get the quorum for hdd, 2. I kill large-1, 3. create the ceph cluster with data backed by hdd, 4. create the bucket for pages, 5. migrate the pages 6. test
<bentiss> 7. kill large-4
<bentiss> arf, larg-4 is also used for the artifacts
<daniels> heh
<daniels> I think a couple of days or so would be fine for the transition
<bentiss> ok
<bentiss> the one thing that might be ablittle bit annoying, is that the buclket name has to be generated by rook, and has a uuid in it
<bentiss> and the credentials are also generated
<bentiss> well, *maybe* we can create an admin user that can create 'normal buckets'
<daniels> wait, rook takes over minio config ... ?
<bentiss> with rook, you can manage buckets as l8s objects
<bentiss> k8s
<daniels> oh!
<daniels> ceph buckets, not minio buckets, sorry
* daniels goes to make a coffee
<bentiss> you havea an ObjectBucketClaim CRD and it creates the bucket and users for you
* bentiss nukes large-1
aleksander has joined #freedesktop
blue__penquin has quit []
chomwitt has quit [Ping timeout: 480 seconds]
<bentiss> daniels: nice! large-7 has 12 disks of 3.7 TB instead of 12 x 2TB
<daniels> !
aleksander has quit []
adjtm has quit [Quit: Leaving]
adjtm has joined #freedesktop
<shadeslayer> hi, I'm trying to access some of my pipelines here https://gitlab.freedesktop.org/shadeslayer/mesa/-/pipelines/327435 but I keep getting a 500 status code, One of these jobs uses the new'ish gitlab feature for displaying screenshots in the test tab
adjtm is now known as Guest310
adjtm has joined #freedesktop
<daniels> shadeslayer: yeah, I'm looking into why that is; you should be able to see the JUnit results as a screenshot from the MRs in any case
<daniels> and you should also be able to go to the job (not pipeline) view directly
Guest310 has quit [Ping timeout: 480 seconds]
<shadeslayer> daniels: hm, I get a 500 when trying to access artifacts too https://gitlab.freedesktop.org/shadeslayer/mesa/-/jobs/10188106/artifacts/browse
<daniels> yep
<daniels> bentiss: ^ this is getting a 404 back from minio.minio-artifacts.svc, even though it thinks it has a valid URL for them?!
<shadeslayer> https://gitlab.freedesktop.org/shadeslayer/mesa/-/pipelines/327435/test_report < fails to fetch the test suite too, so I'm guessing it's a minio issue somewhere?
<daniels> oh, that was last week ...
<daniels> can you please try again with a pipeline from today?
<daniels> long story
<shadeslayer> sure
<shadeslayer> daniels: hm, still a nope, https://gitlab.freedesktop.org/shadeslayer/mesa/-/pipelines/327435/test_report , the screenshots are supposed to be shown in the details for each test case right?
<shadeslayer> https://shadeslayer.pages.freedesktop.org/-/mesa/-/jobs/10275957/artifacts/results/junit.xml < afaict I'm writing it out correctly in system-out
<daniels> shadeslayer: none of your tests are failing
<daniels> shadeslayer: it only stores screenshots on failure, not success, which is good because ... storing a new screenshot in artifacts for every single trace on every single pipeline would overwhelm our storage pretty quickly
<bentiss> it's not like it's in a perfect shape either :) (out storage I mean) ;)
<shadeslayer> ahhhhh .... The documentation doesn't mention any of that
<daniels> I mean, just that a630 job is 348MB of trace screenshots ...
<bentiss> ouch..
<daniels> srsly
<bentiss> can we disable that at the instance level????
<daniels> so yeah, please don't upload unless they're actually different to expected :P
<daniels> bentiss: shrug, screenshot-in-JUnit doesn't do anything in and of itself
<daniels> bentiss: all it does is inline an artifact when you view MR test reports
<daniels> and I don't think we can disable artifacts at the instance level :P
<bentiss> arf
<shadeslayer> Yeah, archiving the screenshots isn't new
<bentiss> anyway, so I got things progressing: large-1 is out, large-2..4 are not holding any ceph data anymore, and large-5..7 are in and ready
<bentiss> daniels: ^^
<daniels> bentiss: ooh, exciting - is there anything I can help with?
<bentiss> so far, I managed to get a fdo-s3 object storage pool up (with data on hdd only), and understood that by creating a user, we can have the buckets names as we wish
<daniels> shadeslayer: but archiving them on pass is new, right? the last time I looked at it, we were only archiving on failure
<bentiss> daniels: I'll get my first covid shot in ~1h, so not sure I'll get much further
<daniels> bentiss: oooh exciting! hope it goes smoothly for you. I'll have a look into what's on the cluster and see if I can progress towards having something we can use
<bentiss> daniels: there are 2 things we need to add to the k3s charts (in helm-gitlab-config/gitlab-k3s-provision/deploy/storage) -> https://paste.centos.org/view/2fe301db for our object storage and https://raw.githubusercontent.com/rook/rook/master/cluster/examples/kubernetes/ceph/toolbox.yaml
<bentiss> daniels: so if you can build up the 2 helmify-kustomize, that would speed up the thing
<bentiss> daniels: the other option is you start migrating the data
<bentiss> the accesskeys are in kubectl -n rook-ceph get secret rook-ceph-object-user-fdo-s3-gitlab
<shadeslayer> daniels: not as far as I can tell
<bentiss> and the service IP is at rook-ceph-rgw-fdo-s3.rook-ceph.svc
<bentiss> daniels: FWIW, on server-2, mc is configured and has an alias 'test-ceph'
<daniels> shadeslayer: uhhhh ... can we please fix that urgently
<daniels> bentiss: heh right, I think I might start doing the data migration from minio-artifacts + minio-pages first so that's going in the background whilst I try to understand how to write my first kustomize :P
<shadeslayer> Sure, I can look into it
<bentiss> daniels: I just started `mc mirror --watch minio-pages test-ceph` here, should be finished quickly enough
<daniels> bentiss: awesome, I can do artifacts + old-minio then?
<daniels> shadeslayer: thankyou :)
<bentiss> daniels: for kustomize, I usually just grab vector-kustomize I think, that one has just the basic stuff in it
* daniels nods
<bentiss> the mirroring of pages was *quick*
<bentiss> like I thought it stalled, but nope, it was just done :)
<bentiss> daniels: the other thing to do is to change the pages secret, deploy it on packet-HA, and see if gitlab can get the data out of it
<bentiss> oh, and yes, please start artifacts (old minio is going to be tricky IMO)
<shadeslayer> daniels: uh, so the artifacts from the a630-traces job is only 35MB for me locally
<daniels> shadeslayer: huh weird, I was just extrapolating from the 0ad example being 5.99MB in and of itself
<shadeslayer> I guess it depends on the resolution the original trace was captured in?
<daniels> true true
<shadeslayer> the 0ad trace is captured in 1440p q.q
<shadeslayer> daniels: it's weird that there are captured images because the piglit-traces-test job only artifacts on failure, and the a630-traces job derives from that (indirectly)
* bentiss is afk fwiw
<daniels> bentiss: bonne chance!
<daniels> shadeslayer: hmm yeah that is odd ... but it would also be good that, even if the job fails, we only store the failed images per-trace
<daniels> not store every single trace image if only one failed
<daniels> fwiw, :artifacts=>{:when=>"always", :name=>"mesa_${CI_JOB_NAME}", :paths=>["results/", "serial*.txt"], :reports=>{:junit=>["results/junit.xml"]}, :exclude=>["results/*.shader_cache"]}
<daniels> which is set by .baremetal-test
<daniels> so that makes sense, just need to prune the images to only the failed ones before we leave the job cx
<daniels> *ctx
ximion1 has quit []
yk has joined #freedesktop
<shadeslayer> so the question is why the images aren't dropped on a successful replay
<shadeslayer> lovely
<daniels> hmm, no agomez on IRC
<daniels> oh wait, my mental map fails me
<daniels> tanty: ^ any idea why we unconditionally pass --keep-image to the Piglit replayer now, rather than only artifacting trace images on failure?
<tanty> let me check and refresh my mind ...
mynacol[m] has joined #freedesktop
<shadeslayer> I created https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11088 in case we want to only artifact on fail
<gitlab-bot> Mesa issue (Merge request) 11088 in mesa "ci: Do not keep images if trace replay is successful" [Ci, Opened]
alatiera4 is now known as alatiera
<bentiss> daniels: I'm back. no side effects for the moment :)
<bentiss> daniels:so have you been able to do things?
<vsyrjala> anongit down?
<tanty> diffs previews have not been working for a while in gitlab today ...
<daniels> bentiss: wow, you're back quick - I've only just got out of a call after quickly having lunch, so starting now
<bentiss> daniels: OK, I am working on testing packet-HA with the new pages bucket FWIW
<bentiss> but I'm hitting the no matches for kind "Issuer" in version "certmanager.k8s.io/v1alpha1"
<bentiss> I'just re-enable cert-manager I think
mynacol has joined #freedesktop
<bentiss> daniels: before starting the sync between minio-artifacts and ceph, we should delete the fdo-gitlab-pages/ bucket in it
<bentiss> OK, it's empty now
asimiklit has joined #freedesktop
<daniels> bentiss: cool, it's going now - and using 5G rather than cable since that's less awful
<bentiss> daniels: using ceph as a page source is working fine, I am applying the config to the old cluster and will deploy a pages site to check if everything is still fine
<bentiss> daniels: ssh to server-2 and do the mc mirror from there
<bentiss> so the data stays on packet all the time, no?
* bentiss is not sure why would 5g help here
<bentiss> 502 expected in the next few minutes
ttt has joined #freedesktop
<daniels> bentiss: ah yes, I forgot that we had kube creds on server-2
chomwitt has joined #freedesktop
<bentiss> daniels: mc is already configured for both normally
<daniels> ah, I'd not realised that
<daniels> wow is it ever slow though - like a minute-long stall after every new file?
<bentiss> daniels: well, it first processes all the files, then start them as batch
<bentiss> and given that there should be some files... it can take some time at the beginning before it kicks in
<bentiss> ok, pages deployment validated \o/
<daniels> woohoo
ttt has quit []
<bentiss> daniels: should I delete *minio*-pages now?
<daniels> sorry, was just trying to figure out why mc mirror was taking much longer than before to spin up
<daniels> erm yeah, might as well nuke it if the pages daemon on the old cluster is already pointing at ceph?
<bentiss> yep
<bentiss> and ok!
<daniels> thanks :)
asimiklit has quit [Quit: Page closed]
mynacol has quit []
mynacol[m] has left #freedesktop [#freedesktop]
* bentiss starts taking care of the backups
Guest205 is now known as blue_penquin
asimiklit has joined #freedesktop
<bentiss> hmm, looks like the policy for removing files older than 7 days did not kick in...
<bentiss> anyway
<bentiss> there was one legalhold, but 1621574601_2021_05_21_13.11.2_gitlab_backup.tar should have been removed :(
<asimiklit> daniels: Hi, I would like to ask a question regarding account @asimiklit which was removed from gitlab.freedesktop.org last weekend. Are you aware of something regarding it?
<asimiklit> daniels: Forgot to mention, my colleague forwarded me emails that indicate that you closed all my MRs last weekend so that is why I wrote you.
<asimiklit> daniels: I am just trying to find out why that account was removed ...
<daniels> asimiklit: oh my god, that wasn't deliberate but just a huge mistake. please accept my apologies. let me pull some backups and try to see what I can recreate
<bentiss> sigh, copying today's backup interrupted half-way through it :/
<asimiklit> daniels: Huh, at least that wasn't hacked as I expected) Don't worry a lot that is just an account but if there is some possibility to recreate something it would be great)
<bentiss> daniels: oh, well, I'll deal with backups tomorrow I think
<daniels> asimiklit: not hacked, just dumb - sorry
<daniels> asimiklit: I'll see what I can get for you
<daniels> bentiss: np, I'll be around all night so I'll shift the backups
<bentiss> daniels: ok, thanks.
<bentiss> maybe using s3cmd will have a better chance of success
<daniels> bentiss: btw, any thoughts on what we should do about OPA? maybe a MinIO gateway just to do the policy? I couldn't see anything in Ceph objstore about policy callouts
<bentiss> daniels: a Minio gateway is actually a very nice idea
<bentiss> I was thinking at declaring a tenant for minio-packet, but the gateway is nice
<daniels> ok, cool :)
<bentiss> daniels: have you stopped mirroring artifacts?
<daniels> bentiss: no?
<bentiss> seems like everything stalls
<daniels> though it has stalled ... eyah
<bentiss> we are writing at 374 KiB/s, that's not godd....
<bentiss> good
<bentiss> anyway, got family coming by this evening, got to go afk
<bentiss> good luck with the transfer!
<daniels> have a fun night!
chomwitt has quit [Ping timeout: 480 seconds]
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
node1 has joined #freedesktop
node1 has left #freedesktop [#freedesktop]
<MrCooper> daniels: doesn't https://gitlab.freedesktop.org/mesa/mesa/-/pipelines/new work for running a full pipeline (including the pages job) from the Mesa main branch?
<imirkin> daniels: cgit down again
<daniels> MrCooper: didn't think so
<daniels> imirkin: kicked
<imirkin> thanks
<daniels> don't worry, it's going to fail on artifacts :P
<daniels> but good to know, I didn't realise manually-triggered pipelines did that, thanks
<imirkin> daniels: dunno if it's expected, but cgit is now refusing to accept connections (as opposed to hanging). dunno how long it's supposed to boot...
<daniels> imirkin: it can take a little while
<imirkin> kk
<daniels> like 5-10min sometimes
<imirkin> sounds good
<MrCooper> daniels: they behave as if all files had been modified (same as pipelines for a newly created branch)
<daniels> nice
<imirkin> daniels: still not responding (but now hanging rather than connection refused)
<daniels> progress!
<imirkin> daniels: well, it's like 15 mins after you kicked it, so...
* daniels applies a more stern hammer
<imirkin> daniels: much better
<daniels> \o/
chomwitt has joined #freedesktop
asimiklit has quit [Remote host closed the connection]
aaronp has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
danvet has joined #freedesktop
<bentiss> daniels: so it seems the mc mirror is hanging because it's done. I mean, it also copied everything in /tmp, which mankes me think it's done
<daniels> hmm, it's completed on the artifacts chart
<bentiss> we have all the jobs logs since 2021_05_28
<daniels> I did have to cordon + stop + restart + uncordon large-2, because it had gone into I/O death again
<bentiss> chart?
<daniels> *bucket
<bentiss> oh...
<bentiss> s3cmd seems way more efficient than mc.... 273 MiB/s to upload the old logs (well, it's using data from a regular dir too
<daniels> heh ...
<daniels> mc mirror is doing 150MiB/sec here, but it's very very heavily biased by large vs. small files
<bentiss> true
<bentiss> OTOH, I hit ctrl-C to change the arguments (and use --progress), and it's hanging now :)
<bentiss> FWIW, s3cmd has an '--include' arg which allows to grab only the job.log files
<bentiss> when mc only has the --exclude
<bentiss> actually, now that you mention it, maybe it hasn't started and the 273 MiB/s I saw was you :)
<daniels> haha
<daniels> this is weird though:
<daniels> again.: cause(too few shards given)
<bentiss> indeed
<bentiss> while copying from the old minio to the new cluster, I had a bunch of errors when the body content length was 0, maybe that's the followup
<bentiss> daniels: I think I'll got to bed. FWIW, I am running `s3cmd sync -v --progress --include job.log fdo-gitlab-artifacts/ s3://fdo-gitlab-artifacts` on large-2 which supposedly will upload all the job.log from the backup to the new ceph data
<bentiss> daniels: also if you have finished mirroring the artifacts (that are currently on the mini-artifacts), feel free to change the globals.yaml settings with the new ceph storage, you can reuse freedesktop-prod-ceph-s3-key for the secret with the connection parameters
aaronp has quit [Ping timeout: 480 seconds]
danvet has quit [Ping timeout: 480 seconds]
jpnurmi has joined #freedesktop
jstein has joined #freedesktop
ximion has joined #freedesktop
chomwitt has quit [Ping timeout: 480 seconds]
jstein has quit [Ping timeout: 480 seconds]
SanchayanMaity has quit [Remote host closed the connection]
austriancoder has quit [Read error: Connection reset by peer]
austriancoder has joined #freedesktop
SanchayanMaity has joined #freedesktop
aaronp has joined #freedesktop
aaronp has quit [Remote host closed the connection]