<bentiss>
oh well, I realy need to move away from minio cluster: we are getting a new wave of 500 errors while uploading artifacts
<bentiss>
daniels: FWIW, backup of the old minio-artifacts done, will kill the EBS attached to it
<daniels>
ooh, just in time
<bentiss>
the problem is: given that there are no more transfers, why are we still getting 500????
<bentiss>
(besides minio-cluster not being something we should use)
blue__penquin has joined #freedesktop
<tomeu>
I have noticed that when we are getting 500s on artifacts upload, jobs on x86 runners take a long time to get picked up
ximion1 has joined #freedesktop
ximion has quit [Read error: Connection reset by peer]
<daniels>
bentiss: hmm yeah, I was seeing when things were bad yesterday that we'd get a 'Job succeeded' message in the trace from runners, without the job actually being marked as succeeded; about 5min later there'd be a second 'job succeeded' message and it would finally be marked as successful
<daniels>
is there anything I can do to help with MinIO?
<bentiss>
daniels: I am scratching my head on how to move the data without spinning too much s1.large
<bentiss>
cause right now I only have one node with hdds in ceph, and I need 3
<bentiss>
Ideally, I'd like to convert large-4 and large-3 into ceph, but for that I either need to scale down the md array,either need to migrate it's data elsewhere
<bentiss>
actually, for large 4, it should be doable to scale down the md array
<bentiss>
only 109 GB used on 22 TB aray, scaling down the fs should be doable
<daniels>
can xfs shrink online though?
<daniels>
I thought we'd need to take it offline
<bentiss>
damn... that's what I am looking for
<bentiss>
daniels: and I am sure you are going to tell me ext4 is capable of online shrinking?
<bentiss>
looks like it's the same problem
<daniels>
heh
<daniels>
yeah, I haven't seen anything so far ... :(
<bentiss>
daniels: question: do we care about the data (artifacts) we have on large-1, or we just consider that week being lost?
<daniels>
just the last week, right? if so, I think it's ok to burn them to be honest, if it makes it easier to move quicker
<bentiss>
cause I would gladly kill that machine and replace it with a new one for ceph
<bentiss>
ack, when migrating to ceph, we'll try to get all the job logs from the backup (so without last week), and sync with the data from the current minio-cluster
* daniels
nods
<bentiss>
daniels: how does that sound: 1. I spin up 2 s1.large to get the quorum for hdd, 2. I kill large-1, 3. create the ceph cluster with data backed by hdd, 4. create the bucket for pages, 5. migrate the pages 6. test
<bentiss>
7. kill large-4
<bentiss>
arf, larg-4 is also used for the artifacts
<daniels>
heh
<daniels>
I think a couple of days or so would be fine for the transition
<bentiss>
ok
<bentiss>
the one thing that might be ablittle bit annoying, is that the buclket name has to be generated by rook, and has a uuid in it
<bentiss>
and the credentials are also generated
<bentiss>
well, *maybe* we can create an admin user that can create 'normal buckets'
<daniels>
wait, rook takes over minio config ... ?
<bentiss>
with rook, you can manage buckets as l8s objects
<daniels>
shadeslayer: none of your tests are failing
<daniels>
shadeslayer: it only stores screenshots on failure, not success, which is good because ... storing a new screenshot in artifacts for every single trace on every single pipeline would overwhelm our storage pretty quickly
<bentiss>
it's not like it's in a perfect shape either :) (out storage I mean) ;)
<shadeslayer>
ahhhhh .... The documentation doesn't mention any of that
<daniels>
I mean, just that a630 job is 348MB of trace screenshots ...
<bentiss>
ouch..
<daniels>
srsly
<bentiss>
can we disable that at the instance level????
<daniels>
so yeah, please don't upload unless they're actually different to expected :P
<daniels>
bentiss: shrug, screenshot-in-JUnit doesn't do anything in and of itself
<daniels>
bentiss: all it does is inline an artifact when you view MR test reports
<daniels>
and I don't think we can disable artifacts at the instance level :P
<bentiss>
arf
<shadeslayer>
Yeah, archiving the screenshots isn't new
<bentiss>
anyway, so I got things progressing: large-1 is out, large-2..4 are not holding any ceph data anymore, and large-5..7 are in and ready
<bentiss>
daniels: ^^
<daniels>
bentiss: ooh, exciting - is there anything I can help with?
<bentiss>
so far, I managed to get a fdo-s3 object storage pool up (with data on hdd only), and understood that by creating a user, we can have the buckets names as we wish
<daniels>
shadeslayer: but archiving them on pass is new, right? the last time I looked at it, we were only archiving on failure
<bentiss>
daniels: I'll get my first covid shot in ~1h, so not sure I'll get much further
<daniels>
bentiss: oooh exciting! hope it goes smoothly for you. I'll have a look into what's on the cluster and see if I can progress towards having something we can use
<bentiss>
daniels: so if you can build up the 2 helmify-kustomize, that would speed up the thing
<bentiss>
daniels: the other option is you start migrating the data
<bentiss>
the accesskeys are in kubectl -n rook-ceph get secret rook-ceph-object-user-fdo-s3-gitlab
<shadeslayer>
daniels: not as far as I can tell
<bentiss>
and the service IP is at rook-ceph-rgw-fdo-s3.rook-ceph.svc
<bentiss>
daniels: FWIW, on server-2, mc is configured and has an alias 'test-ceph'
<daniels>
shadeslayer: uhhhh ... can we please fix that urgently
<daniels>
bentiss: heh right, I think I might start doing the data migration from minio-artifacts + minio-pages first so that's going in the background whilst I try to understand how to write my first kustomize :P
<shadeslayer>
Sure, I can look into it
<bentiss>
daniels: I just started `mc mirror --watch minio-pages test-ceph` here, should be finished quickly enough
<daniels>
bentiss: awesome, I can do artifacts + old-minio then?
<daniels>
shadeslayer: thankyou :)
<bentiss>
daniels: for kustomize, I usually just grab vector-kustomize I think, that one has just the basic stuff in it
* daniels
nods
<bentiss>
the mirroring of pages was *quick*
<bentiss>
like I thought it stalled, but nope, it was just done :)
<bentiss>
daniels: the other thing to do is to change the pages secret, deploy it on packet-HA, and see if gitlab can get the data out of it
<bentiss>
oh, and yes, please start artifacts (old minio is going to be tricky IMO)
<shadeslayer>
daniels: uh, so the artifacts from the a630-traces job is only 35MB for me locally
<daniels>
shadeslayer: huh weird, I was just extrapolating from the 0ad example being 5.99MB in and of itself
<shadeslayer>
I guess it depends on the resolution the original trace was captured in?
<daniels>
true true
<shadeslayer>
the 0ad trace is captured in 1440p q.q
<shadeslayer>
daniels: it's weird that there are captured images because the piglit-traces-test job only artifacts on failure, and the a630-traces job derives from that (indirectly)
* bentiss
is afk fwiw
<daniels>
bentiss: bonne chance!
<daniels>
shadeslayer: hmm yeah that is odd ... but it would also be good that, even if the job fails, we only store the failed images per-trace
<daniels>
not store every single trace image if only one failed
<tanty>
diffs previews have not been working for a while in gitlab today ...
<daniels>
bentiss: wow, you're back quick - I've only just got out of a call after quickly having lunch, so starting now
<bentiss>
daniels: OK, I am working on testing packet-HA with the new pages bucket FWIW
<bentiss>
but I'm hitting the no matches for kind "Issuer" in version "certmanager.k8s.io/v1alpha1"
<bentiss>
I'just re-enable cert-manager I think
mynacol has joined #freedesktop
<bentiss>
daniels: before starting the sync between minio-artifacts and ceph, we should delete the fdo-gitlab-pages/ bucket in it
<bentiss>
OK, it's empty now
asimiklit has joined #freedesktop
<daniels>
bentiss: cool, it's going now - and using 5G rather than cable since that's less awful
<bentiss>
daniels: using ceph as a page source is working fine, I am applying the config to the old cluster and will deploy a pages site to check if everything is still fine
<bentiss>
daniels: ssh to server-2 and do the mc mirror from there
<bentiss>
so the data stays on packet all the time, no?
* bentiss
is not sure why would 5g help here
<bentiss>
502 expected in the next few minutes
ttt has joined #freedesktop
<daniels>
bentiss: ah yes, I forgot that we had kube creds on server-2
chomwitt has joined #freedesktop
<bentiss>
daniels: mc is already configured for both normally
<daniels>
ah, I'd not realised that
<daniels>
wow is it ever slow though - like a minute-long stall after every new file?
<bentiss>
daniels: well, it first processes all the files, then start them as batch
<bentiss>
and given that there should be some files... it can take some time at the beginning before it kicks in
<bentiss>
ok, pages deployment validated \o/
<daniels>
woohoo
ttt has quit []
<bentiss>
daniels: should I delete *minio*-pages now?
<daniels>
sorry, was just trying to figure out why mc mirror was taking much longer than before to spin up
<daniels>
erm yeah, might as well nuke it if the pages daemon on the old cluster is already pointing at ceph?
<bentiss>
yep
<bentiss>
and ok!
<daniels>
thanks :)
asimiklit has quit [Quit: Page closed]
mynacol has quit []
mynacol[m] has left #freedesktop [#freedesktop]
* bentiss
starts taking care of the backups
Guest205 is now known as blue_penquin
asimiklit has joined #freedesktop
<bentiss>
hmm, looks like the policy for removing files older than 7 days did not kick in...
<bentiss>
anyway
<bentiss>
there was one legalhold, but 1621574601_2021_05_21_13.11.2_gitlab_backup.tar should have been removed :(
<asimiklit>
daniels: Hi, I would like to ask a question regarding account @asimiklit which was removed from gitlab.freedesktop.org last weekend. Are you aware of something regarding it?
<asimiklit>
daniels: Forgot to mention, my colleague forwarded me emails that indicate that you closed all my MRs last weekend so that is why I wrote you.
<asimiklit>
daniels: I am just trying to find out why that account was removed ...
<daniels>
asimiklit: oh my god, that wasn't deliberate but just a huge mistake. please accept my apologies. let me pull some backups and try to see what I can recreate
<bentiss>
sigh, copying today's backup interrupted half-way through it :/
<asimiklit>
daniels: Huh, at least that wasn't hacked as I expected) Don't worry a lot that is just an account but if there is some possibility to recreate something it would be great)
<bentiss>
daniels: oh, well, I'll deal with backups tomorrow I think
<daniels>
asimiklit: not hacked, just dumb - sorry
<daniels>
asimiklit: I'll see what I can get for you
<daniels>
bentiss: np, I'll be around all night so I'll shift the backups
<bentiss>
daniels: ok, thanks.
<bentiss>
maybe using s3cmd will have a better chance of success
<daniels>
bentiss: btw, any thoughts on what we should do about OPA? maybe a MinIO gateway just to do the policy? I couldn't see anything in Ceph objstore about policy callouts
<bentiss>
daniels: a Minio gateway is actually a very nice idea
<bentiss>
I was thinking at declaring a tenant for minio-packet, but the gateway is nice
<daniels>
ok, cool :)
<bentiss>
daniels: have you stopped mirroring artifacts?
<daniels>
bentiss: no?
<bentiss>
seems like everything stalls
<daniels>
though it has stalled ... eyah
<bentiss>
we are writing at 374 KiB/s, that's not godd....
<bentiss>
good
<bentiss>
anyway, got family coming by this evening, got to go afk
<bentiss>
good luck with the transfer!
<daniels>
have a fun night!
chomwitt has quit [Ping timeout: 480 seconds]
alanc has quit [Remote host closed the connection]
<daniels>
don't worry, it's going to fail on artifacts :P
<daniels>
but good to know, I didn't realise manually-triggered pipelines did that, thanks
<imirkin>
daniels: dunno if it's expected, but cgit is now refusing to accept connections (as opposed to hanging). dunno how long it's supposed to boot...
<daniels>
imirkin: it can take a little while
<imirkin>
kk
<daniels>
like 5-10min sometimes
<imirkin>
sounds good
<MrCooper>
daniels: they behave as if all files had been modified (same as pipelines for a newly created branch)
<daniels>
nice
<imirkin>
daniels: still not responding (but now hanging rather than connection refused)
<daniels>
progress!
<imirkin>
daniels: well, it's like 15 mins after you kicked it, so...
* daniels
applies a more stern hammer
<imirkin>
daniels: much better
<daniels>
\o/
chomwitt has joined #freedesktop
asimiklit has quit [Remote host closed the connection]
aaronp has joined #freedesktop
danvet has quit [Ping timeout: 480 seconds]
danvet has joined #freedesktop
<bentiss>
daniels: so it seems the mc mirror is hanging because it's done. I mean, it also copied everything in /tmp, which mankes me think it's done
<daniels>
hmm, it's completed on the artifacts chart
<bentiss>
we have all the jobs logs since 2021_05_28
<daniels>
I did have to cordon + stop + restart + uncordon large-2, because it had gone into I/O death again
<bentiss>
chart?
<daniels>
*bucket
<bentiss>
oh...
<bentiss>
s3cmd seems way more efficient than mc.... 273 MiB/s to upload the old logs (well, it's using data from a regular dir too
<daniels>
heh ...
<daniels>
mc mirror is doing 150MiB/sec here, but it's very very heavily biased by large vs. small files
<bentiss>
true
<bentiss>
OTOH, I hit ctrl-C to change the arguments (and use --progress), and it's hanging now :)
<bentiss>
FWIW, s3cmd has an '--include' arg which allows to grab only the job.log files
<bentiss>
when mc only has the --exclude
<bentiss>
actually, now that you mention it, maybe it hasn't started and the 273 MiB/s I saw was you :)
<bentiss>
while copying from the old minio to the new cluster, I had a bunch of errors when the body content length was 0, maybe that's the followup
<bentiss>
daniels: I think I'll got to bed. FWIW, I am running `s3cmd sync -v --progress --include job.log fdo-gitlab-artifacts/ s3://fdo-gitlab-artifacts` on large-2 which supposedly will upload all the job.log from the backup to the new ceph data
<bentiss>
daniels: also if you have finished mirroring the artifacts (that are currently on the mini-artifacts), feel free to change the globals.yaml settings with the new ceph storage, you can reuse freedesktop-prod-ceph-s3-key for the secret with the connection parameters
aaronp has quit [Ping timeout: 480 seconds]
danvet has quit [Ping timeout: 480 seconds]
jpnurmi has joined #freedesktop
jstein has joined #freedesktop
ximion has joined #freedesktop
chomwitt has quit [Ping timeout: 480 seconds]
jstein has quit [Ping timeout: 480 seconds]
SanchayanMaity has quit [Remote host closed the connection]
austriancoder has quit [Read error: Connection reset by peer]
austriancoder has joined #freedesktop
SanchayanMaity has joined #freedesktop
aaronp has joined #freedesktop
aaronp has quit [Remote host closed the connection]