#freedesktop on 2021-05-31 — irc logs at oftc.irclog.whitequark.org

01:00 alatiera4 has joined #freedesktop

01:03 alatiera has quit [Ping timeout: 480 seconds]

01:15 adjtm has quit [Ping timeout: 480 seconds]

01:24 adjtm has joined #freedesktop

02:24 thaytan_ has joined #freedesktop

02:25 thaytan has quit []

02:25 thaytan_ has quit []

02:26 thaytan has joined #freedesktop

03:30 jarthur has quit [Ping timeout: 480 seconds]

03:31 ximion has quit []

05:41 chomwitt has joined #freedesktop

05:57 sumits has joined #freedesktop

06:54 danvet has joined #freedesktop

07:34 chomwitt has quit [Quit: Leaving]

07:34 chomwitt has joined #freedesktop

07:42 ximion has joined #freedesktop

07:47 <bentiss> oh well, I realy need to move away from minio cluster: we are getting a new wave of 500 errors while uploading artifacts

07:48 <bentiss> daniels: FWIW, backup of the old minio-artifacts done, will kill the EBS attached to it

07:53 <daniels> ooh, just in time

07:56 <bentiss> the problem is: given that there are no more transfers, why are we still getting 500????

07:58 <bentiss> (besides minio-cluster not being something we should use)

07:59 blue__penquin has joined #freedesktop

08:04 <tomeu> I have noticed that when we are getting 500s on artifacts upload, jobs on x86 runners take a long time to get picked up

08:11 ximion1 has joined #freedesktop

08:11 ximion has quit [Read error: Connection reset by peer]

08:47 <daniels> bentiss: hmm yeah, I was seeing when things were bad yesterday that we'd get a 'Job succeeded' message in the trace from runners, without the job actually being marked as succeeded; about 5min later there'd be a second 'job succeeded' message and it would finally be marked as successful

08:47 <daniels> is there anything I can do to help with MinIO?

08:48 <bentiss> daniels: I am scratching my head on how to move the data without spinning too much s1.large

08:48 <bentiss> cause right now I only have one node with hdds in ceph, and I need 3

08:49 <bentiss> Ideally, I'd like to convert large-4 and large-3 into ceph, but for that I either need to scale down the md array,either need to migrate it's data elsewhere

08:50 <bentiss> actually, for large 4, it should be doable to scale down the md array

08:50 <bentiss> only 109 GB used on 22 TB aray, scaling down the fs should be doable

08:53 <daniels> can xfs shrink online though?

08:53 <daniels> I thought we'd need to take it offline

08:54 <bentiss> damn... that's what I am looking for

08:55 <bentiss> daniels: and I am sure you are going to tell me ext4 is capable of online shrinking?

08:55 <bentiss> looks like it's the same problem

08:57 <daniels> heh

08:58 <daniels> yeah, I haven't seen anything so far ... :(

08:59 <bentiss> daniels: question: do we care about the data (artifacts) we have on large-1, or we just consider that week being lost?

09:00 <daniels> just the last week, right? if so, I think it's ok to burn them to be honest, if it makes it easier to move quicker

09:00 <bentiss> cause I would gladly kill that machine and replace it with a new one for ceph

09:01 <bentiss> ack, when migrating to ceph, we'll try to get all the job logs from the backup (so without last week), and sync with the data from the current minio-cluster

09:06 * daniels nods

09:06 <bentiss> daniels: how does that sound: 1. I spin up 2 s1.large to get the quorum for hdd, 2. I kill large-1, 3. create the ceph cluster with data backed by hdd, 4. create the bucket for pages, 5. migrate the pages 6. test

09:06 <bentiss> 7. kill large-4

09:07 <bentiss> arf, larg-4 is also used for the artifacts

09:07 <daniels> heh

09:07 <daniels> I think a couple of days or so would be fine for the transition

09:07 <bentiss> ok

09:08 <bentiss> the one thing that might be ablittle bit annoying, is that the buclket name has to be generated by rook, and has a uuid in it

09:08 <bentiss> and the credentials are also generated

09:09 <bentiss> well, *maybe* we can create an admin user that can create 'normal buckets'

09:10 <daniels> wait, rook takes over minio config ... ?

09:11 <bentiss> with rook, you can manage buckets as l8s objects

09:11 <bentiss> k8s

09:11 <bentiss> https://github.com/rook/rook/blob/master/Documentation/ceph-object.md#create-a-bucket

09:11 <daniels> oh!

09:11 <daniels> ceph buckets, not minio buckets, sorry

09:11 * daniels goes to make a coffee

09:12 <bentiss> you havea an ObjectBucketClaim CRD and it creates the bucket and users for you

09:14 * bentiss nukes large-1

09:30 aleksander has joined #freedesktop

09:41 blue__penquin has quit []

10:04 chomwitt has quit [Ping timeout: 480 seconds]

10:12 <bentiss> daniels: nice! large-7 has 12 disks of 3.7 TB instead of 12 x 2TB

10:27 <daniels> !

10:54 aleksander has quit []

11:00 adjtm has quit [Quit: Leaving]

11:01 adjtm has joined #freedesktop

11:16 <shadeslayer> hi, I'm trying to access some of my pipelines here https://gitlab.freedesktop.org/shadeslayer/mesa/-/pipelines/327435 but I keep getting a 500 status code, One of these jobs uses the new'ish gitlab feature for displaying screenshots in the test tab

11:19 adjtm is now known as Guest310

11:19 adjtm has joined #freedesktop

11:21 <daniels> shadeslayer: yeah, I'm looking into why that is; you should be able to see the JUnit results as a screenshot from the MRs in any case

11:21 <daniels> and you should also be able to go to the job (not pipeline) view directly

11:26 Guest310 has quit [Ping timeout: 480 seconds]

11:43 <shadeslayer> daniels: hm, I get a 500 when trying to access artifacts too https://gitlab.freedesktop.org/shadeslayer/mesa/-/jobs/10188106/artifacts/browse

11:48 <daniels> yep

11:48 <daniels> bentiss: ^ this is getting a 404 back from minio.minio-artifacts.svc, even though it thinks it has a valid URL for them?!

11:48 <shadeslayer> https://gitlab.freedesktop.org/shadeslayer/mesa/-/pipelines/327435/test_report < fails to fetch the test suite too, so I'm guessing it's a minio issue somewhere?

11:55 <daniels> oh, that was last week ...

11:55 <daniels> can you please try again with a pipeline from today?

11:55 <daniels> long story

12:03 <shadeslayer> sure

12:14 <shadeslayer> daniels: hm, still a nope, https://gitlab.freedesktop.org/shadeslayer/mesa/-/pipelines/327435/test_report , the screenshots are supposed to be shown in the details for each test case right?

12:14 <shadeslayer> https://shadeslayer.pages.freedesktop.org/-/mesa/-/jobs/10275957/artifacts/results/junit.xml < afaict I'm writing it out correctly in system-out

12:25 <daniels> shadeslayer: none of your tests are failing

12:25 <daniels> shadeslayer: it only stores screenshots on failure, not success, which is good because ... storing a new screenshot in artifacts for every single trace on every single pipeline would overwhelm our storage pretty quickly

12:26 <bentiss> it's not like it's in a perfect shape either :) (out storage I mean) ;)

12:26 <shadeslayer> ahhhhh .... The documentation doesn't mention any of that

12:26 <daniels> I mean, just that a630 job is 348MB of trace screenshots ...

12:26 <bentiss> ouch..

12:26 <daniels> srsly

12:26 <bentiss> can we disable that at the instance level????

12:26 <daniels> so yeah, please don't upload unless they're actually different to expected :P

12:27 <daniels> bentiss: shrug, screenshot-in-JUnit doesn't do anything in and of itself

12:27 <daniels> bentiss: all it does is inline an artifact when you view MR test reports

12:27 <daniels> and I don't think we can disable artifacts at the instance level :P

12:27 <bentiss> arf

12:28 <shadeslayer> Yeah, archiving the screenshots isn't new

12:28 <bentiss> anyway, so I got things progressing: large-1 is out, large-2..4 are not holding any ceph data anymore, and large-5..7 are in and ready

12:28 <bentiss> daniels: ^^

12:29 <daniels> bentiss: ooh, exciting - is there anything I can help with?

12:29 <bentiss> so far, I managed to get a fdo-s3 object storage pool up (with data on hdd only), and understood that by creating a user, we can have the buckets names as we wish

12:29 <daniels> shadeslayer: but archiving them on pass is new, right? the last time I looked at it, we were only archiving on failure

12:30 <bentiss> daniels: I'll get my first covid shot in ~1h, so not sure I'll get much further

12:32 <daniels> bentiss: oooh exciting! hope it goes smoothly for you. I'll have a look into what's on the cluster and see if I can progress towards having something we can use

12:33 <bentiss> daniels: there are 2 things we need to add to the k3s charts (in helm-gitlab-config/gitlab-k3s-provision/deploy/storage) -> https://paste.centos.org/view/2fe301db for our object storage and https://raw.githubusercontent.com/rook/rook/master/cluster/examples/kubernetes/ceph/toolbox.yaml

12:33 <bentiss> daniels: so if you can build up the 2 helmify-kustomize, that would speed up the thing

12:34 <bentiss> daniels: the other option is you start migrating the data

12:34 <bentiss> the accesskeys are in kubectl -n rook-ceph get secret rook-ceph-object-user-fdo-s3-gitlab

12:34 <shadeslayer> daniels: not as far as I can tell

12:34 <bentiss> and the service IP is at rook-ceph-rgw-fdo-s3.rook-ceph.svc

12:36 <bentiss> daniels: FWIW, on server-2, mc is configured and has an alias 'test-ceph'

12:36 <daniels> shadeslayer: uhhhh ... can we please fix that urgently

12:37 <daniels> bentiss: heh right, I think I might start doing the data migration from minio-artifacts + minio-pages first so that's going in the background whilst I try to understand how to write my first kustomize :P

12:37 <shadeslayer> Sure, I can look into it

12:37 <bentiss> daniels: I just started `mc mirror --watch minio-pages test-ceph` here, should be finished quickly enough

12:38 <daniels> bentiss: awesome, I can do artifacts + old-minio then?

12:38 <daniels> shadeslayer: thankyou :)

12:38 <bentiss> daniels: for kustomize, I usually just grab vector-kustomize I think, that one has just the basic stuff in it

12:38 * daniels nods

12:39 <bentiss> the mirroring of pages was *quick*

12:40 <bentiss> like I thought it stalled, but nope, it was just done :)

12:42 <bentiss> daniels: the other thing to do is to change the pages secret, deploy it on packet-HA, and see if gitlab can get the data out of it

12:43 <bentiss> oh, and yes, please start artifacts (old minio is going to be tricky IMO)

12:46 <shadeslayer> daniels: uh, so the artifacts from the a630-traces job is only 35MB for me locally

12:47 <daniels> shadeslayer: huh weird, I was just extrapolating from the 0ad example being 5.99MB in and of itself

12:48 <shadeslayer> I guess it depends on the resolution the original trace was captured in?

12:48 <daniels> true true

12:49 <shadeslayer> the 0ad trace is captured in 1440p q.q

12:50 <shadeslayer> daniels: it's weird that there are captured images because the piglit-traces-test job only artifacts on failure, and the a630-traces job derives from that (indirectly)

12:50 * bentiss is afk fwiw

12:52 <daniels> bentiss: bonne chance!

12:53 <daniels> shadeslayer: hmm yeah that is odd ... but it would also be good that, even if the job fails, we only store the failed images per-trace

12:53 <daniels> not store every single trace image if only one failed

12:54 <daniels> fwiw, :artifacts=>{:when=>"always", :name=>"mesa_${CI_JOB_NAME}", :paths=>["results/", "serial*.txt"], :reports=>{:junit=>["results/junit.xml"]}, :exclude=>["results/*.shader_cache"]}

12:54 <daniels> which is set by .baremetal-test

12:55 <daniels> so that makes sense, just need to prune the images to only the failed ones before we leave the job cx

12:55 <daniels> *ctx

13:11 ximion1 has quit []

13:18 yk has joined #freedesktop

13:54 <shadeslayer> daniels: hm, this should be handling things https://gitlab.freedesktop.org/mesa/piglit/-/blob/main/framework/replay/compare_replay.py#L101

13:55 <shadeslayer> so the question is why the images aren't dropped on a successful replay

14:01 <daniels> shadeslayer: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/.gitlab-ci/lava.yml.jinja2#L110

14:04 <shadeslayer> lovely

14:09 <daniels> hmm, no agomez on IRC

14:09 <daniels> oh wait, my mental map fails me

14:09 <daniels> tanty: ^ any idea why we unconditionally pass --keep-image to the Piglit replayer now, rather than only artifacting trace images on failure?

14:14 <tanty> let me check and refresh my mind ...

14:17 mynacol[m] has joined #freedesktop

14:18 <shadeslayer> I created https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11088 in case we want to only artifact on fail

14:18 <gitlab-bot> Mesa issue (Merge request) 11088 in mesa "ci: Do not keep images if trace replay is successful" [Ci, Opened]

14:18 alatiera4 is now known as alatiera

14:21 <bentiss> daniels: I'm back. no side effects for the moment :)

14:21 <bentiss> daniels:so have you been able to do things?

14:23 <vsyrjala> anongit down?

14:25 <shadeslayer> looks like it works https://gitlab.freedesktop.org/mesa/mesa/-/jobs/10280442/artifacts/browse/results/trace/freedreno-a630/0ad/

14:29 <tanty> diffs previews have not been working for a while in gitlab today ...

14:34 <daniels> bentiss: wow, you're back quick - I've only just got out of a call after quickly having lunch, so starting now

14:35 <bentiss> daniels: OK, I am working on testing packet-HA with the new pages bucket FWIW

14:35 <bentiss> but I'm hitting the no matches for kind "Issuer" in version "certmanager.k8s.io/v1alpha1"

14:36 <bentiss> I'just re-enable cert-manager I think

14:36 mynacol has joined #freedesktop

14:38 <bentiss> daniels: before starting the sync between minio-artifacts and ceph, we should delete the fdo-gitlab-pages/ bucket in it

14:41 <bentiss> OK, it's empty now

14:46 asimiklit has joined #freedesktop

15:06 <daniels> bentiss: cool, it's going now - and using 5G rather than cable since that's less awful

15:07 <bentiss> daniels: using ceph as a page source is working fine, I am applying the config to the old cluster and will deploy a pages site to check if everything is still fine

15:07 <bentiss> daniels: ssh to server-2 and do the mc mirror from there

15:07 <bentiss> so the data stays on packet all the time, no?

15:07 * bentiss is not sure why would 5g help here

15:10 <bentiss> 502 expected in the next few minutes

15:12 ttt has joined #freedesktop

15:12 <daniels> bentiss: ah yes, I forgot that we had kube creds on server-2

15:13 chomwitt has joined #freedesktop

15:13 <bentiss> daniels: mc is already configured for both normally

15:16 <daniels> ah, I'd not realised that

15:16 <daniels> wow is it ever slow though - like a minute-long stall after every new file?

15:17 <bentiss> daniels: well, it first processes all the files, then start them as batch

15:17 <bentiss> and given that there should be some files... it can take some time at the beginning before it kicks in

15:17 <bentiss> ok, pages deployment validated \o/

15:18 <daniels> woohoo

15:19 ttt has quit []

15:21 <bentiss> daniels: should I delete *minio*-pages now?

15:22 <daniels> sorry, was just trying to figure out why mc mirror was taking much longer than before to spin up

15:22 <daniels> erm yeah, might as well nuke it if the pages daemon on the old cluster is already pointing at ceph?

15:22 <bentiss> yep

15:22 <bentiss> and ok!

15:23 <daniels> thanks :)

15:23 asimiklit has quit [Quit: Page closed]

15:24 mynacol has quit []

15:24 mynacol[m] has left #freedesktop [#freedesktop]

15:24 * bentiss starts taking care of the backups

15:24 Guest205 is now known as blue_penquin

15:24 asimiklit has joined #freedesktop

15:25 <bentiss> hmm, looks like the policy for removing files older than 7 days did not kick in...

15:25 <bentiss> anyway

15:26 <bentiss> there was one legalhold, but 1621574601_2021_05_21_13.11.2_gitlab_backup.tar should have been removed :(

15:27 <asimiklit> daniels: Hi, I would like to ask a question regarding account @asimiklit which was removed from gitlab.freedesktop.org last weekend. Are you aware of something regarding it?

15:36 <asimiklit> daniels: Forgot to mention, my colleague forwarded me emails that indicate that you closed all my MRs last weekend so that is why I wrote you.

15:47 <asimiklit> daniels: I am just trying to find out why that account was removed ...

15:53 <daniels> asimiklit: oh my god, that wasn't deliberate but just a huge mistake. please accept my apologies. let me pull some backups and try to see what I can recreate

16:10 <bentiss> sigh, copying today's backup interrupted half-way through it :/

16:11 <asimiklit> daniels: Huh, at least that wasn't hacked as I expected) Don't worry a lot that is just an account but if there is some possibility to recreate something it would be great)

16:13 <bentiss> daniels: oh, well, I'll deal with backups tomorrow I think

16:15 <daniels> asimiklit: not hacked, just dumb - sorry

16:15 <daniels> asimiklit: I'll see what I can get for you

16:15 <daniels> bentiss: np, I'll be around all night so I'll shift the backups

16:15 <bentiss> daniels: ok, thanks.

16:15 <bentiss> maybe using s3cmd will have a better chance of success

16:17 <daniels> bentiss: btw, any thoughts on what we should do about OPA? maybe a MinIO gateway just to do the policy? I couldn't see anything in Ceph objstore about policy callouts

16:18 <bentiss> daniels: a Minio gateway is actually a very nice idea

16:18 <bentiss> I was thinking at declaring a tenant for minio-packet, but the gateway is nice

16:19 <daniels> ok, cool :)

16:19 <bentiss> daniels: have you stopped mirroring artifacts?

16:19 <daniels> bentiss: no?

16:19 <bentiss> seems like everything stalls

16:20 <daniels> though it has stalled ... eyah

16:20 <bentiss> we are writing at 374 KiB/s, that's not godd....

16:20 <bentiss> good

16:22 <bentiss> anyway, got family coming by this evening, got to go afk

16:22 <bentiss> good luck with the transfer!

16:23 <daniels> have a fun night!

16:35 chomwitt has quit [Ping timeout: 480 seconds]

16:46 alanc has quit [Remote host closed the connection]

16:47 alanc has joined #freedesktop

16:49 node1 has joined #freedesktop

16:49 node1 has left #freedesktop [#freedesktop]

16:56 <MrCooper> daniels: doesn't https://gitlab.freedesktop.org/mesa/mesa/-/pipelines/new work for running a full pipeline (including the pages job) from the Mesa main branch?

17:02 <imirkin> daniels: cgit down again

17:02 <daniels> MrCooper: didn't think so

17:02 <daniels> imirkin: kicked

17:03 <imirkin> thanks

17:05 <MrCooper> daniels: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/10286954 seems to work?

17:05 <daniels> don't worry, it's going to fail on artifacts :P

17:05 <daniels> but good to know, I didn't realise manually-triggered pipelines did that, thanks

17:05 <imirkin> daniels: dunno if it's expected, but cgit is now refusing to accept connections (as opposed to hanging). dunno how long it's supposed to boot...

17:06 <daniels> imirkin: it can take a little while

17:06 <imirkin> kk

17:06 <daniels> like 5-10min sometimes

17:06 <imirkin> sounds good

17:07 <MrCooper> daniels: they behave as if all files had been modified (same as pipelines for a newly created branch)

17:14 <daniels> nice

17:17 <imirkin> daniels: still not responding (but now hanging rather than connection refused)

17:18 <daniels> progress!

17:18 <imirkin> daniels: well, it's like 15 mins after you kicked it, so...

17:18 * daniels applies a more stern hammer

17:22 <imirkin> daniels: much better

17:22 <daniels> \o/

17:29 chomwitt has joined #freedesktop

17:38 asimiklit has quit [Remote host closed the connection]

18:10 aaronp has joined #freedesktop

18:32 danvet has quit [Ping timeout: 480 seconds]

18:57 danvet has joined #freedesktop

18:58 <bentiss> daniels: so it seems the mc mirror is hanging because it's done. I mean, it also copied everything in /tmp, which mankes me think it's done

18:58 <daniels> hmm, it's completed on the artifacts chart

18:58 <bentiss> we have all the jobs logs since 2021_05_28

18:59 <daniels> I did have to cordon + stop + restart + uncordon large-2, because it had gone into I/O death again

18:59 <bentiss> chart?

18:59 <daniels> *bucket

18:59 <bentiss> oh...

19:00 <bentiss> s3cmd seems way more efficient than mc.... 273 MiB/s to upload the old logs (well, it's using data from a regular dir too

19:02 <daniels> heh ...

19:02 <daniels> mc mirror is doing 150MiB/sec here, but it's very very heavily biased by large vs. small files

19:02 <bentiss> true

19:03 <bentiss> OTOH, I hit ctrl-C to change the arguments (and use --progress), and it's hanging now :)

19:03 <bentiss> FWIW, s3cmd has an '--include' arg which allows to grab only the job.log files

19:03 <bentiss> when mc only has the --exclude

19:05 <bentiss> actually, now that you mention it, maybe it hasn't started and the 273 MiB/s I saw was you :)

19:07 <daniels> haha

19:07 <daniels> this is weird though:

19:07 <daniels> mc: <ERROR> Failed to copy `http://10.41.120.128/fdo-gitlab-artifacts/63/43/634387a8562c6dd18faa2ef24e3ec6c967d878fa43acc1cc6cdfd4e1f6ee083b/2021_05_28/10204904/7055949/job.log`. Put "http://10.41.168.59/fdo-gitlab-artifacts/63/43/634387a8562c6dd18faa2ef24e3ec6c967d878fa43acc1cc6cdfd4e1f6ee083b/2021_05_28/10204904/7055949/job.log": readfrom tcp 10.99.237.141:33846->10.41.168.59:80: We encountered an internal error, please try

19:07 <daniels> again.: cause(too few shards given)

19:08 <bentiss> indeed

19:09 <bentiss> while copying from the old minio to the new cluster, I had a bunch of errors when the body content length was 0, maybe that's the followup

19:14 <bentiss> daniels: I think I'll got to bed. FWIW, I am running `s3cmd sync -v --progress --include job.log fdo-gitlab-artifacts/ s3://fdo-gitlab-artifacts` on large-2 which supposedly will upload all the job.log from the backup to the new ceph data

19:18 <bentiss> daniels: also if you have finished mirroring the artifacts (that are currently on the mini-artifacts), feel free to change the globals.yaml settings with the new ceph storage, you can reuse freedesktop-prod-ceph-s3-key for the secret with the connection parameters

19:42 aaronp has quit [Ping timeout: 480 seconds]

19:43 danvet has quit [Ping timeout: 480 seconds]

19:44 jpnurmi has joined #freedesktop

20:18 jstein has joined #freedesktop

20:31 ximion has joined #freedesktop

20:38 chomwitt has quit [Ping timeout: 480 seconds]

22:30 jstein has quit [Ping timeout: 480 seconds]

22:45 SanchayanMaity has quit [Remote host closed the connection]

22:46 austriancoder has quit [Read error: Connection reset by peer]

22:51 austriancoder has joined #freedesktop

22:58 SanchayanMaity has joined #freedesktop

23:06 aaronp has joined #freedesktop

23:06 aaronp has quit [Remote host closed the connection]