KF5YFD has quit [Remote host closed the connection]
jarthur has quit [Ping timeout: 480 seconds]
gawen has joined #freedesktop
gawen has quit [Remote host closed the connection]
baldur has joined #freedesktop
baldur has quit [Remote host closed the connection]
noahhsmith[m|gr] has joined #freedesktop
noahhsmith[m|gr] has quit [Remote host closed the connection]
Quimby has joined #freedesktop
Quimby has quit [Remote host closed the connection]
belmoussaoui has joined #freedesktop
belmoussaoui has quit [Remote host closed the connection]
patrick4370 has joined #freedesktop
blue__penquin has joined #freedesktop
blue__penquin has quit []
dk657 has joined #freedesktop
dk657 has quit [Remote host closed the connection]
pendingchaos_ has joined #freedesktop
pendingchaos has quit [Ping timeout: 480 seconds]
thaytan has quit [Ping timeout: 480 seconds]
thaytan has joined #freedesktop
xedniv has joined #freedesktop
xedniv has quit [Remote host closed the connection]
<bentiss>
daniels: thanks for that! (sorry I was out most of yesterday)
ximion has quit []
CrtxReavr has joined #freedesktop
CrtxReavr has quit [Remote host closed the connection]
<bentiss>
daniels: I *just* realized that I may simple have busted the routing between the 2 clusters before: basically alal the large agents could not talk to the old cluster, which might be an explanation on why we had timeouts :(
* bentiss
forgot to add the routes to the nodes not managed by kilo
jarthur has joined #freedesktop
haveo has joined #freedesktop
haveo has quit [Remote host closed the connection]
Cyrinux[m]1 has joined #freedesktop
Cyrinux[m]1 has quit [Remote host closed the connection]
perryflynn has joined #freedesktop
perryflynn has quit [Remote host closed the connection]
chomwitt has joined #freedesktop
chomwitt has quit [Remote host closed the connection]
sanehatter has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 09:06:15)]
dcat has joined #freedesktop
dcat has quit [Remote host closed the connection]
nroberts has quit [Quit: Gxis!]
Travis17 has joined #freedesktop
Travis17 has quit [Remote host closed the connection]
nroberts has joined #freedesktop
danvet has joined #freedesktop
patrick4370 has quit [Quit: Quit: Powering down]
chomwitt has quit [Ping timeout: 480 seconds]
gholms26 has joined #freedesktop
gholms26 has quit [Remote host closed the connection]
caubert17 has joined #freedesktop
caubert17 has quit [Remote host closed the connection]
yuesbeez has joined #freedesktop
yuesbeez has quit [Remote host closed the connection]
lazywalker has joined #freedesktop
lazywalker has quit [Read error: Connection reset by peer]
fub has joined #freedesktop
fub has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 10:30:19)]
mayab76 has joined #freedesktop
mayab76 has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 10:49:14)]
pendingchaos_ has left #freedesktop [#freedesktop]
Mike[m]6 has joined #freedesktop
Mike[m]6 has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 11:03:55)]
pendingchaos has joined #freedesktop
EvilSpork has joined #freedesktop
EvilSpork has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 11:15:54)]
RaTTuS|BIG has joined #freedesktop
RaTTuS|BIG has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 11:16:45)]
<daniels>
ooooh
Wug has joined #freedesktop
Wug has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 11:32:24)]
docontherocks has joined #freedesktop
docontherocks has quit [Remote host closed the connection]
flesk_ has joined #freedesktop
flesk_ has quit [Remote host closed the connection]
<bentiss>
yes... ooooh (facepalm)
<bentiss>
daniels: I gave a lot of thought at minio and others, and I think we better use the object storage from ceph for artifacts/pages and backups
<bentiss>
because this one has the advantage of being much more resilient to reboots
<bentiss>
the minio instance I created for pages is only on fdo-large-4, which makes it useless to have a "minio cluster"
<bentiss>
daniels: unless you have a big no no, I'll start the conversion (but I need to spin up some more large machines to get the corum of hdd disks availables
<nirbheek>
You only need to set the +R via chanserv mlock if you've previously set -R using mlock
<daniels>
bentiss: yeah, I think that given we're using Ceph anyway, it makes sense to use it for this, and it lets us get storage resilience by Ceph's own sharding rather than having to also manage MinIO sharding
* bentiss
nods
<bentiss>
daniels: I am currently trying to back up the old artifacts data on large-2, in case we need it. Not sure I'll get to backup it entirely before June 1st though
<bentiss>
daniels: and this morning I managed to kick all the elastic storages from the old cluster, but the artifact one (which is backuping as mentioned above)
* daniels
nods
prompt has joined #freedesktop
prompt has quit [autokilled: Possible spambot. Mail support@oftc.net if you think this is in error. (2021-05-30 12:14:57)]
<daniels>
how much artifact storage do we have?
<bentiss>
~ 2TB on the old cluster
<bentiss>
at ~30 MiB/s, it takes quite some time
<daniels>
heh!!
<bentiss>
(last time we ended up the backup at ~15 MiB/s
Yardanico23 has joined #freedesktop
Yardanico23 has quit [Remote host closed the connection]
<daniels>
yeah, when the XFS recovery was happening (mount /var/openebs/local) I was seeing a constant 16MiB/s read from the disk
<daniels>
which seems ... oddly slow
<daniels>
s/the disk/disks/
chomwitt has joined #freedesktop
russruss40 has joined #freedesktop
russruss40 has quit [Remote host closed the connection]
blue_penquin is now known as Guest205
<karolherbst>
ahh.. those are the admins I like: wanted to report an issue, but it's been already worked on :3
<daniels>
karolherbst: which issue?
<karolherbst>
the 404 one
<daniels>
which repo please?
<daniels>
or which URL, even
<daniels>
(I'm going through sorting all of them out but would be nice to double-check)
<daniels>
WARNING: Failed to pull image with policy "always": Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit (manager.go:205:0s)
<karolherbst>
:(
Vladi has joined #freedesktop
Vladi has quit [Remote host closed the connection]
<daniels>
blegh, Alpine doesn't have any mirrors outside the Docker Hub
moffa has quit [Remote host closed the connection]
<daniels>
bentiss: are you doing the move atm? Errno::ETIMEDOUT (Failed to open TCP connection to minio.minio-artifacts.svc:80 (Connection timed out - connect(2) for "minio.minio-artifacts.svc" port 80)):
<daniels>
getting lots of those from the server-1 cluster (i.e. where elastic currently points)
ximion has joined #freedesktop
Nafallo has joined #freedesktop
Nafallo has quit [Remote host closed the connection]
<bentiss>
daniels: yep, still copying at 30MiB/s, on large-2, where part of the artifacts are
<bentiss>
on 47 errors in the past 24h, I think we can bare some more days of recovery, until we switch to ceph object storage
<bentiss>
*only
* daniels
nods
Repentinus has joined #freedesktop
<daniels>
just wondering webservice -> minio connections timing out after 30sec (like, can't establish TCP conn) is expected or not
Repentinus has quit [Remote host closed the connection]
aleblanc has joined #freedesktop
aleblanc has quit [Remote host closed the connection]
blue__penquin has quit []
jrtc273 has joined #freedesktop
jrtc273 has quit [Remote host closed the connection]
Radon has joined #freedesktop
Radon has quit [Remote host closed the connection]
jeanluc has joined #freedesktop
jeanluc has quit [Remote host closed the connection]
laguneucl has joined #freedesktop
chomwitt has quit [Read error: No route to host]
laguneucl has quit [Ping timeout: 480 seconds]
alanc has quit [Remote host closed the connection]
alanc has joined #freedesktop
sdrodge has joined #freedesktop
sdrodge is now known as Guest224
Guest224 has quit [Remote host closed the connection]
ximion has quit []
ximion has joined #freedesktop
<bentiss>
unless flannel changed the peer secret key, we should not have timeouts
<bentiss>
well, that is assuming the sync operation to flush the data on the disk doesn't block the rest of the server
LambdaComplex has joined #freedesktop
LambdaComplex has quit [Remote host closed the connection]
<daniels>
he
<daniels>
ok, I can try to take a look further
<daniels>
bentiss: was just going to upgrade to 13.2.1 whilst I'm here - the only real change is Ingress changing from extensions/v1beta1 to networking.k8s.io/v1, which is supported by our version of k3s - do you see any problem in that upgrade e.g. any custom resources or patches?
<daniels>
(have helm-gitlab-config on the packet-ha branch, will deploy to both k3s-1 & k3s-ha envs)
<daniels>
*13.12.1
<bentiss>
daniels: ack, but after the upgrade, you have to also manually deploy the services.yaml and services_ha.yaml with the correct IPs (gitaly-0..2)
fbt25 has joined #freedesktop
<daniels>
bentiss: so for e.g. new-cluster/gitlab-prod-gitaly-0, I need to get the svc IP of gitlab/gitlab-prod-gitaly-0 on the HA cluster, and update services.yaml with that IP, then deploy services.yaml to the old cluster, right?
<bentiss>
actually, praefect in ha will probably change too (you need to pick the pod IP in gitaly/praefect case)
fbt25 has quit [Remote host closed the connection]
<daniels>
(and the reverse for services_ha.yaml, update it with the svc IPs from the _old_ cluster?)
<bentiss>
daniels: correct, to the exception that you need to take the *pod* ip, the service is headless (without Ip)
<daniels>
... for gitaly
<daniels>
but for minio it should point to the svc?
chomwitt has joined #freedesktop
<bentiss>
... and praefect in the old cluster...
<bentiss>
the svc ip are static and should not change, so we are safe here
<bentiss>
actually, in HA, minio-artifacts-prod could be removed, we don't use it anymore
<daniels>
hmmmm, praefect in the currently-committed services_ha.yaml points to an IP which isn't claimed by either of the praefect pods or the praefect svc on the old cluster?
<bentiss>
really? that might explain the issue with timeouts then
<daniels>
heh ...
<daniels>
HA Endpoint gitlab/gitlab-prod-praefect points to 10.42.2.120, on the old cluster the praefect svc has 10.42.1.144 and 10.42.2.52
<bentiss>
correct...
<bentiss>
so, praefect is not used in the ha config, we use only gitlab-prod-gitaly-no-replicas.old-cluster.svc.cluster.local which is supposed to point at a praefect pod
<bentiss>
(that's convoluted, I know)
<bentiss>
and indeed, I applied a new conf today while removing the "old" default preafect pod
<bentiss>
that should explain the issues with the timeouts
<daniels>
so sorry for the really dumb question, but just to be super sure I'm not going to do anything idiotic, I should update gitlab-prod-praefect.gitlab.svc in services_ha.yaml to 10.42.2.52 which is the pod IP of the old cluster's gitlab/gitlab-prod-gitaly-0 pod?
<bentiss>
th eother way around
<daniels>
hm?
<bentiss>
see configs/packet-HA/globals.gotmpl -> we use gitlab-prod-gitaly-no-replicas.old-cluster.svc.cluster.local only for no-replicas
<bentiss>
so that's the one that needs update
<bentiss>
the problem, is if you talk to the gitaly pod, it tells you 'no-replica' doesn't exist, because praefect uses some internal naming
<bentiss>
so you need to update gitlab-prod-gitaly-no-replicas.old-cluster.svc.cluster.local to one of the praefect pods
chomwitt has quit [Ping timeout: 480 seconds]
<daniels>
so services_ha.yaml is getting applied to the _new_ cluster, and in that I'm pointing gitlab-prod-gitaly-no-replicas.old-cluster.svc to one of the gitlab-prod-praefect pod IPs from the _old_ cluster, right?
<bentiss>
correct :)
<bentiss>
it should have been just a temp thing... sorry
<daniels>
hehe, no prob, so I think what I'm doing is:
<daniels>
* upgrade on packet-ha
<daniels>
* upgrade on packet-old
<daniels>
* change packet-ha gitlab-prod-gitaly-no-replicas.old-cluster.svc endpoint to point to correct IP for gitlab-prod-praefect (on the old cluster)
<daniels>
* change services_ha.yaml gitlab-prod-gitaly-[012].new-cluster.svc endpoint to point to correct IP for each pod in the new cluster, and apply the same changes to services.yaml for the old cluster
<daniels>
* remove gitlab-prod-{praefect,gitaly-default-0}.old-cluster.svc in services_ha.yaml as they're unused?
karolherbst_ has joined #freedesktop
karolherbst is now known as Guest233
karolherbst_ is now known as karolherbst
Guest233 has quit [Ping timeout: 480 seconds]
karolherbst is now known as karolherbst_
karolherbst_ has quit []
karolherbst_ has joined #freedesktop
karolherbst_ has quit []
karolherbst has joined #freedesktop
<karolherbst>
okay.. can we move away from oftc? it seems to be a shitty network after all... *sigh*.. sooo... if somebody has network issues and the connection drops and the client automatically reconnects via a different nick at some point, you are stuck as channels don't allow you to change your nick while moderated....... *sigh* :( :( at least on freenode you could claim multiple nicknames, but on oftc you apparently can't
<karolherbst>
without creating multiple accounts
<karolherbst>
this is very frustrating
<karolherbst>
and the ghost command requires the pw as well
<karolherbst>
and you can't login into your account unless you changed your nick
<daniels>
karolherbst: /m nickserv help link
<daniels>
you can also use CertFP to automatically identify without a password
<karolherbst>
ehhhh
<karolherbst>
why didn't I see that before? :(
karolherbst is now known as karolherbst_
karolherbst_ is now known as karolherbst
<karolherbst>
ahhhh :/
<karolherbst>
that doesn't work as nicely :(
<karolherbst>
daniels: so apparently both nicks have to be registered before you can even use link :(
<karolherbst>
this doesn't make any sense..
karolherbst is now known as karolherbst_
<daniels>
it's ... not ideal
<daniels>
they're working on moving to another ircd/services platform which is much more similar to Freenode
<daniels>
(this was one of the things which precipitated all the current drama I believe, as FN+OFTC staff were collaborating on the same ircd)
karolherbst_ is now known as karolherbst
<karolherbst>
ahh
karolherbst is now known as kherbst
kherbst is now known as karolherbst
<karolherbst>
I am not sure if you can even reuse the same email address for all nicknames :D oh well...
<daniels>
bentiss: sorry for the barrage of novice questions btw - I think it's partly because I've missed a lot of stuff recently, partly because the transitional stuff is somewhat unclear, but also in quite large part because my brain still isn't anywhere near full capacity (and am pretty sure I'll be asleep on the sofa again by about 8pm ...)
<karolherbst>
this is just super annoying :/
<karolherbst>
I don't even understand why the network won't allow you to rename yourself just because a channel requires registered users to speak
<daniels>
because you can use it to spam channels you're moderated in, constantly renaming yourself to 'speak' through nick changes
<karolherbst>
ohh sure, but I am not moderated, everybody unregistered is
<karolherbst>
so I couldn't change my nick, because random channels don't allow unregistered ones to speak :)
<karolherbst>
but it seems like an identify command helps here
<karolherbst>
just normally you have this all automated and you don't have your pw at hand, so this is all very annoying
<daniels>
I'm not disagreeing with you btw
<daniels>
just trying to explain some of the context behind it
<daniels>
not claiming that the current semantics are perfect :)
<karolherbst>
ohh, I know, I am mostly venting I think.. it just makes perfeclty sense why freenode was the biggest network honestly
<bentiss>
daniels: I think you are correct in your understanding
<bentiss>
daniels: I am not so sure it's novice questions, but more that I did things in my head and probably did not share enough with you while dealing with it
<daniels>
bentiss: no prob, I don't blame you, and doing it with haste was definitely better than trying to get me to understand at every point! thanks again :)
* daniels
embarks on the voyage
<daniels>
please prepare for a bumpy half-hour or so on GitLab; fasten your seatbelts and do not move about the cabin
<bentiss>
daniels: maybe upgrade kubectl, helm and helmfile?
<daniels>
bentiss: oh no, it was my fault, just not doing the v1beta1 -> v1 Ingress upgrade properly for the custom resources
chomwitt has joined #freedesktop
<bentiss>
oh, ok
<daniels>
bentiss: eek, failing to actually apply the chart to packet-HA, because even though certmgr has been disabled in the config, it hasn't actually been disabled on the cluster?
<daniels>
e.g. helm diff shows us removing all the existing certmgr objects
<daniels>
is something half-applied somewhere?
<bentiss>
daniels: oops, yes
<bentiss>
but forget about cert-manager, it fails at renewing the certs on HA, because it doesn't have the correct IP
<bentiss>
just remove it, it's fine
<bentiss>
we'll re-add it once we change the load balancer IP
<daniels>
mmm, but removing it gives Error: Failed to render chart: exit status 1: Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "Issuer" in version "certmanager.k8s.io/v1alpha1"
<daniels>
when building charts/freedesktop
<daniels>
ah, I think I need to make it conditional from pages ingress
chomwitt has quit [Ping timeout: 480 seconds]
<bentiss>
daniels: worse case, just locally remove the issuer, apply the chart, and we'll deal with it when we re-enable cert-manager
<bentiss>
daniels: do you want me to set up the varisous services[-ha].yaml?
<bentiss>
*various
danvet has quit [Ping timeout: 480 seconds]
<bentiss>
daniels: I've locally edited the 2 services files and fired them to have the old cluster being able to talk to the gitaly-0..2 pods
<daniels>
yep, just need to apply to the old cluster now so the migrations take effect
<bentiss>
BTW, to check if the gitaly config is OK, I go to https://gitlab.freedesktop.org/admin/gitaly_servers -> no loading time, all check box green -> we are good; if it takes a bit to load, at least one gitaly pod is not having a correct IP
<daniels>
oooh nice, thanks
<bentiss>
and in case you wonder, I kept 'default' as an indirection to gitaly-0, because we "have" to have a default
<bentiss>
but I think I prefer just having gitaly-N
<bentiss>
makes things easier to debug
<bentiss>
I was thinking on having something like gitaly-0 -> mesa and forks; gitaly-1 -> gstreamer and forks; gitaly-2 -> drm and forks
<bentiss>
and the rest as it wants :)
<daniels>
yeah, that makes sense indeed!
* daniels
crosses fingers, applies to old cluster
* bentiss
takes the popcorns
* bentiss
is glad the backup of the old artifact storage is still at the nominal 30 MiB/s
<bentiss>
when migrating tominio cluster, it was slowly reducing the speed to a tedious 10 MiB/s in the end
<daniels>
bentiss: did you apply the IP changes for the new cluster?
<daniels>
we have no gitaly online atm :P
<bentiss>
daniels: they are online, but with a version webservice is not capable of understanding
<bentiss>
we need to wait for the migration to finish, so the webservice pods can kick up
<daniels>
yeah, just watching it push through now
<daniels>
works \o/
<bentiss>
congrats! <3
<daniels>
to you, rather
<daniels>
I was just a very latent remote keyboard proxy :P
<bentiss>
heh
<bentiss>
have you updated the services_ha.yaml?
<daniels>
artifact upload for mesa now works perfectly \o/
<daniels>
before it took a few minutes to time out ...
<daniels>
could you please push the changes you already made to the services yaml?
<daniels>
then I can apply on top and not revert anything
<bentiss>
sure
<bentiss>
daniels:done
<daniels>
bentiss: oh, should services_ha:gitlab/gitlab-prod-praefect be pointing to gitlab-prod-praefect-0 on the old cluster, or on the new cluster?
<daniels>
I have no updates required on services.yaml, and services_ha.yaml needing gitlab-prod-gitaly-{no-replicas,default-0} needing to be updated to point to the old cluster's gitlab-prod-praefect-0 pod IP
<bentiss>
daniels: neither, we don't use it, so meh... :)
<daniels>
heh
<daniels>
so if that's right I'll apply that
<bentiss>
yep to the second part (updating services_ha.yaml)
<bentiss>
daniels: it's important to get that part right too because: 1. might have delays if the node tries to access no-replicas, and 2. backups are done on the new cluster, so if it's not there, we won't have backups of the repos there
* daniels
nods
<bentiss>
OK, so tomorrow I'll remove the EBS and minio-artifacts on the _old_ cluster, then start moving around the git repos to the new cluster
<bentiss>
once that is done, we will have the db to move, redis, and all the other applications :/
* bentiss
waves and goes to bed
<daniels>
wheeee
<daniels>
'night! thanks a lot for all the patience and help :)
<bentiss>
well, thanks for the hard work. I was able to put some hardwood floor in my office meanwhile :)
<bentiss>
almost finished!
<daniels>
nicely done! I got a bit more done on the garden today \o/
<daniels>
it started off the day looking like our multi-cluster configuration :P
<daniels>
(TIL: pages:deploy will refuse to deploy if the SHA is behind the ref HEAD)
<daniels>
karolherbst: woohoo, docs.mesa3d.org is back