2025-07-01 06:27:51 armhf and s390x gitlab shared-runners are failing to pull registry.alpinelinux.org/alpine/infra/docker/alpine-gitlab-ci:latest https://gitlab.alpinelinux.org/lotheac/aports/-/jobs/1915804 2025-07-01 10:42:43 seems to have recovered now. 2025-07-01 10:43:51 Yeah, I have just fixed it, thanks 2025-07-01 10:47:19 good timing :) 2025-07-01 10:47:31 thanks for fixing 2025-07-01 13:45:26 hello, could someone check on the edge loongarch64 builder? it may be stuck? 2025-07-01 14:13:53 in on it 2025-07-01 15:07:39 thanks! 2025-07-01 18:48:46 meow how does security.a.o get its branches? 3.22 is not yet fetched by it 2025-07-01 18:50:10 https://gitlab.alpinelinux.org/alpine/infra/docker/secfixes-tracker/-/blob/master/config/prod.settings.py 2025-07-01 19:01:27 ah 2025-07-01 19:01:28 https://gitlab.alpinelinux.org/alpine/infra/docker/secfixes-tracker/-/merge_requests/13 2025-07-01 19:21:38 achill: thanks! I deployed it now 2025-07-01 19:35:33 thank you! 2025-07-02 08:08:12 any reason it does not use releases.json? 2025-07-02 15:38:53 does somebody know why the log of the build failure on build-edge-aarch64 is 404? https://build.alpinelinux.org/buildlogs/build-edge-aarch64/community/py3-pytest-qt/py3-pytest-qt-4.5.0-r0.log 2025-07-02 19:58:39 lotheac: Testing your MR. It creates a runner, but the pod fails to run because (I guess the operator) set runAsNonRoot true 2025-07-02 19:58:41 Error: container has runAsNonRoot and image will run as root 2025-07-03 00:57:31 ikke: you can add pod.spec.containers.securityContext.runAsUser & runAsGroup to specify the uid/gid to run as 2025-07-03 00:58:01 (or remove the runAsNonRoot if you want to run as root) 2025-07-03 02:00:45 i suppose by default the operator expects an image that sets default uid to non-zero 2025-07-03 02:00:57 let me add a patch to that 2025-07-03 02:08:00 ikke: added patch to the runner.spec.podSpec to set runAsNonRoot=false. (not sure if the builds can handle running as non-root in the container -- if so we could do it the other way) 2025-07-03 03:57:16 also added a commit to reduce the privilege level of the controller-manager to be namespace-scoped 2025-07-03 13:53:52 could someone maybe check on the edge builders' logs? they're accessible, but logs for new builds don't seem to be updated at the moment? 2025-07-03 13:58:39 e.g. there is a package rebuilding today, but the log timestamp is from 01 Jul 2025 2025-07-03 15:06:27 distfiles disk was full, cleaning up 2025-07-03 15:34:08 thanks :) 2025-07-03 15:39:11 achill: ^ 2025-07-03 15:39:19 ohhh nice thanks! 2025-07-03 15:40:25 in other unrelated news, chromium source archives are huge 2025-07-03 15:45:20 surprise.... 2025-07-03 15:46:32 chromium everything is huge. source, build times, final packages, memory usage, build time memory usage, debug builds, 2025-07-03 21:35:54 uhhh now it looks like the websocket server of https://build.alpinelinux.org/ is down 2025-07-04 05:19:44 achill: fixed 2025-07-04 05:31:52 lotheac: thanks, note that it's currently the runner itself that apparently tries to run as root, not the build / helper image yet. 2025-07-04 05:33:28 docker run -it --rm --entrypoint /bin/sh registry.gitlab.com/gitlab-org/gitlab-runner:alpine-bleeding -c 'id -u' 2025-07-04 05:33:30 -> 0 2025-07-04 05:34:09 So strange that it sets runAsNonRoot to true while the image runs as root 2025-07-04 06:02:55 ikke I'm consistently getting HTTP 500 from GitLab when creating MR. is it just me? 2025-07-04 06:13:17 rnalrd: do you have up to date master branch in your fork? 2025-07-04 06:21:07 rebase from master branch 2025-07-04 06:32:31 ikke: oh huh, that is strange indeed 2025-07-04 06:34:06 lotheac: the fix is still the same 2025-07-04 06:42:31 yep pj 2025-07-04 06:45:16 ikke: for sure it is, but if their default image uses root as well then how is it working for anybody else either :thinking: 2025-07-04 06:51:15 Yes, same question here 2025-07-04 06:54:05 https://gitlab.com/gitlab-org/gl-openshift/gitlab-runner-operator/-/issues/238 2025-07-04 06:54:49 i guess the answer is "it's not" :) 2025-07-04 06:56:52 anyway good point about this not being the actual build image. maybe we could just run it as some other uid 2025-07-04 06:58:13 as in... patch runAsUser+runAsGroup instead of runAsNonRoot 2025-07-04 07:00:30 Note that the default helper image does assume root and hence our helper image as well 2025-07-04 07:13:52 sure. that’s probably fine (but i haven’t been able to verify what kind of pods the runner will actually create) 2025-07-04 07:15:44 I'll be testing that next 2025-07-04 10:26:39 lotheac: https://gitlab.alpinelinux.org/sertonix/aports/-/jobs/1919841 2025-07-04 14:34:48 ikke: can’t look very closely today, but when i looked at the operator i think it defaulted to assuming the builder pods would be in the same namespace as the runner pods 2025-07-04 14:36:06 (”gitlab-runner-system” by default iirc). if that is different, we need to modify rbac so it can do stuff in gitlab-ci ns as well 2025-07-04 14:36:44 s/we need/we probably need/ 2025-07-05 03:33:39 right, looks like i specified namespace: gitlab-ci in the runner objs 2025-07-05 03:42:31 i think the quickest fix is running the builders in gitlab-runner-system 2025-07-05 03:43:33 from a quick glance, i think maybe the operator tries to create a role in the target namespace for this purpose -- i think that's why it wanted the wide-open ClusterRole you noticed 2025-07-05 03:44:26 don't know how it would behave with a precreated role in the target ns -- or with a role in the target ns that allows it to create more roles there -- so i figure it's easier to get it running at all by just shoving it into gitlab-runner-system for now 2025-07-05 03:45:11 (ie. the same ns as the Runner objects) 2025-07-05 03:46:16 made that change in my MR 2025-07-05 20:19:01 seems we got some spammers on gitlab 2025-07-05 20:19:10 https://gitlab.alpinelinux.org/-/snippets/265#note_522606 2025-07-05 20:19:36 ptrc: constantly, I'm cleaning most of it up 2025-07-09 08:00:20 aarch64 CI runner che-ci-1 shared-runner gitlab-runner-aarch64.che-ci-1 seems to not have disk space left 2025-07-09 08:00:22 https://gitlab.alpinelinux.org/alpine/aports/-/jobs/1926827 2025-07-10 20:40:40 gitlab runner appears to checkout git repo in a weird/broken way? https://gitlab.alpinelinux.org/alpine/ca-certificates/-/merge_requests/17 2025-07-10 20:44:23 https://gitlab.alpinelinux.org/alpine/ca-certificates/-/jobs/1929998 2025-07-10 20:52:07 ncopa: the issue has been fixed, but now it's actually failing 2025-07-10 20:53:06 why is the directory permissions wrong? 2025-07-10 20:53:12 why is it checked out as root? 2025-07-10 20:53:25 or can we change the permissions of the dir? 2025-07-10 20:55:00 ncopa: that's how gitlab ci works in docker. They use a helper image to do all the git work, and that runs as root 2025-07-10 20:56:06 The alternative is assuming everything runs as uid 1000 2025-07-11 14:36:44 do you guys find gitlab to be becoming increasingly bloated with recent updates? 2025-07-11 14:37:04 specifically w.r.t. memory usage, but also cpu 2025-07-11 14:39:53 Has been rather stable for us (aside from load generated by scrapers) 2025-07-11 14:41:21 We still use the same server as we started with (same specs) 2025-07-11 14:42:39 what's your approximate memory footprint for gitlab? 2025-07-11 14:46:41 ~10-12GB with spikes. That's with all components on a single server 2025-07-11 14:48:38 strange. we have two independent gitlab instances that are using about 30GB at startup, and usually up to 55GB each. obviously not all of it is actively used but I don't have a clear picture of why. 2025-07-11 14:58:21 What processes are taking up the memory? 2025-07-11 15:10:07 Cool 2025-07-11 15:10:22 ncopa: clandmeter it's back :) 2025-07-11 15:12:18 nice \o/ 2025-07-11 15:14:44 Nice 2025-07-11 15:15:23 server was rebooted yesterday 2025-07-11 15:15:43 The date was behind 2025-07-11 15:15:50 Sat Apr 12 16:09:37 UTC 2025 2025-07-11 16:03:38 are we back? 2025-07-11 17:02:03 ikke: puma cluster worker / gitlab-puma-worker 2025-07-11 17:02:08 there are a lot of them 2025-07-11 17:38:38 You should tune that based on how much memory you have / want to use 2025-07-11 17:38:47 We only have 6 2025-07-11 18:40:49 ikke: how is the arm builder doing? 2025-07-11 18:41:50 nu_: doing fine again. I did have to enable chrony to fix the time 2025-07-11 18:45:41 nu_: the server was rebooted apparently 2025-07-13 10:27:03 hi 2025-07-13 10:28:05 hello 2025-07-13 10:28:23 wsap 2025-07-13 16:42:17 ikke: yea, there was a power outage, and apparently an ups was faulty 2025-07-13 16:42:36 aha 2025-07-13 16:44:23 ikke: would small interruptions in the coming hours be fine? 2025-07-13 16:45:12 yeah, should be fine 2025-07-13 16:45:20 just the routing. im reworking them to handle outages better 2025-07-13 16:45:39 yeah, that's no problem 2025-07-13 16:45:43 ty! 2025-07-13 21:03:19 Are the armhf/armv7 builders down? 2025-07-13 21:05:14 builders appear to be up 2025-07-13 21:05:41 mqtt-exec wasn't running 2025-07-13 21:07:43 Thanks! 2025-07-14 07:39:21 arm builder is back, sorry! o.o 2025-07-14 07:40:13 Thanks! 2025-07-14 07:40:36 ive done many improvements to it yesterday, but forgot to plug the ups in.. very silly 2025-07-14 07:42:21 the plug is mandatory 2025-07-14 07:42:25 ;-) 2025-07-14 07:42:45 i just wanted to be sure of it :p 2025-07-14 07:43:40 how long does the ups last? 2025-07-14 07:43:53 your area is prone to network disconnects? 2025-07-14 07:52:07 lately yes, but its only because the building management is quirky 2025-07-14 07:52:42 ups lasts about 8 minutes, and the router possibly 8 hours :p 2025-07-14 13:15:01 ikke: any news on ci-cplane-1? or anything else you need me to work on? 2025-07-14 13:25:51 No news yet, was a bit busy and needed to upgrade gitlab 2025-07-14 13:26:21 alright. well done on that upgrade (: 2025-07-14 13:32:22 lotheac: if it interests you, I'm looking for a way to build docker / container images rootlessly to in Kubernetes. I've looked at Buildah + Podman, but from what I could find, it relies on an unmaintained image to setup fuse in k8s. 2025-07-14 13:33:53 sounds reasonably interesting, i could take a look one of these days. i've dabbled in rootless podman containers on kube before... but as i recall it's a bit hairy :) 2025-07-14 13:35:47 regardless, the findings might be useful at $WORK too so yeah why not 2025-07-14 13:35:56 https://gitlab.alpinelinux.org/alpine/infra/docker/exec/-/blob/master/docker-image/entrypoint?ref_type=heads 2025-07-14 13:36:30 This is what we use currently, relies on a docker runner with the dockerd socket mounted 2025-07-14 13:37:21 gotcha 2025-07-14 13:45:59 ikke: what about buildkit? 2025-07-14 13:46:49 docker buildx create --name=kube --driver=kubernetes --driver-opt=namespace=buildkit,rootless=true 2025-07-14 13:47:33 then docker build or docker buildx build should work without issue 2025-07-14 13:48:04 (well, maybe after docker buildx use kube :>) 2025-07-14 13:48:40 https://docs.docker.com/build/builders/drivers/kubernetes/ 2025-07-14 13:54:32 ikke: random finding that says to use vfs storagedriver instead of fuse with rootless buildah https://docs.gitlab.com/ci/docker/buildah_rootless_tutorial/ -- possibly related to what you were describing 2025-07-14 18:22:48 this is a good candidate for riscv64 CI: https://liliputing.com/pine64-alpha-one-is-now-available-fanless-risc-v-pc-with-a-20-tops-npu/ 2025-07-14 18:42:01 there were so many good candidates and hopes, like the pioneer, but all were disappointing 2025-07-14 18:42:06 but hopefully this one i guess 2025-07-14 18:43:13 i have a p550 under my desk. it is fairly performant per cpu core. the pioneer has many cores but are slow per core. 2025-07-14 18:43:36 but I have the machine under my desk powered off because it is noisy 2025-07-14 18:44:47 i see 2025-07-14 18:45:08 i just hope that at one point i dont have to wait 10x for the riscv64 to finish in comparison to all the other one 2025-07-14 18:46:44 we have at least one slow riscv64 CI builder. also on my desk. 2025-07-14 18:47:05 banana bpi f3 or what it is called 2025-07-14 19:23:42 Yeah, mine as well 2025-07-15 10:10:08 ikke: do you remember where we get the timestamp from for ca-certificates? https://gitlab.alpinelinux.org/alpine/ca-certificates/-/commit/35fc750804a78979c392a151aee12ce69e618713 2025-07-15 10:11:45 probably the commit date https://hg-edge.mozilla.org/mozilla-central/file/tip/security/nss/lib/ckfw/builtins/certdata.txt 2025-07-15 10:22:42 ncopa: It must be something like that 2025-07-15 19:36:30 how slow it is? id just like to get an idea. how many would you need of the fastest one you've tested? 10? 100? more? 2025-07-15 20:32:20 nu_: The problem is that invidual builds are taking a very long time. We sometimes have a queue for CI, but that's survivable. But constantly having to wait for rv64 to finish is anoying. Also the builders don't scale with more HW atm 2025-07-16 07:03:35 router decided to have hardware bugs:/ will be replaced today 2025-07-16 07:21:16 Ack, appreciate the update 2025-07-16 13:57:45 i think build-edge-s390x is stuck 2025-07-16 13:57:58 its for hours at a package that shouldve been done it 2 minutes 2025-07-16 14:02:37 ok 2025-07-16 14:02:45 looks like gitlab is 500 now 2025-07-16 14:03:05 https://gitlab.alpinelinux.org/alpine/abuild/-/merge_requests/397 2025-07-16 14:03:44 anything thing https://gitlab.alpinelinux.org/alpine/abuild gives 500 apparently 2025-07-16 14:03:53 yeah some pages load some give "An error occurred while loading merge requests" and some 500 2025-07-16 14:04:23 now it works again :) 2025-07-16 14:04:42 oh not anymore 2025-07-16 14:04:45 Seems to be flaky 2025-07-16 14:04:50 guess thats a sign i should take a break 2025-07-16 14:08:28 Load is not high 2025-07-16 15:51:38 I think there was some query that took a long time 2025-07-16 15:52:29 All postgres connection slots were consumed 2025-07-16 16:03:00 I've increased the max number of connections 2025-07-16 16:09:49 if postgres itself is fine and not overloaded, maybe the client problem could be mitigated by https://docs.gitlab.com/omnibus/settings/database/#setting-client-statement_timeout 2025-07-16 16:10:57 (or equivalent, i have no idea if that documentation is relevant... seems kind of old) 2025-07-16 16:13:04 It was a setting (max_connections) I forgot to replicate when upgrading to a new major postgres version 2025-07-16 16:13:17 I now set it via CLI args in the docker compose config 2025-07-16 16:13:34 gitlab starting from version 16 has 2 separate databases 2025-07-16 16:13:41 so the amount of connections increased 2025-07-16 16:14:02 https://docs.gitlab.com/omnibus/settings/database/#configuring-multiple-database-connections 2025-07-16 16:15:18 lotheac: but thanks for the suggestion 2025-07-16 16:15:24 max_connections on the pg configuration, you mean? 2025-07-16 16:15:37 what was that set to before you changed it? 2025-07-16 16:15:47 np, just trying to understand (: 2025-07-16 16:15:51 100 -> 200 2025-07-16 16:16:41 so it's kinda reasonable to assume it's opening a database connection per request, then, or otherwise it wouldn't have that many? 2025-07-16 16:17:31 which means i could still DoS it if i knew what request to make that doesn't finish quickly? ;) 2025-07-16 16:18:26 I think it's per puma instance 2025-07-16 16:18:37 But there are also other background jobs 2025-07-16 16:25:13 The log messages were quite periodic, many times on the hour 2025-07-16 16:25:29 So i suspect some background job pushing it over the limit 2025-07-16 16:27:58 right -- but i'm wondering if it makes sense for _any_ query that gitlab makes to pg to be taking longer than say, 60s 2025-07-16 16:28:20 the way i'm reading that documentation is, the default is "no limit" 2025-07-16 16:29:43 it would be dumb for it to be spending more than 100 connection slots waiting for things that take minutes 2025-07-16 16:30:11 (because then that would be a problem with the database taking minutes, imo) 2025-07-16 16:36:38 I have not experienced any request that would wait indefinitely 2025-07-16 16:39:02 I believe gitlab sets a statement timeout 2025-07-16 16:39:12 There is none set in postgresql, but I do see timeout in the logs 2025-07-16 16:39:29 ERROR: canceling statement due to statement timeout 2025-07-16 16:39:39 if so, what caused the 500 to normal clients? 2025-07-16 16:40:01 Something consuming all the connections 2025-07-16 16:41:01 which should only happen if there were enough slow db queries happening from the gitlab point of view at the same time, right? 2025-07-16 16:41:17 I think it's possible for something to keep a connection open, right? 2025-07-16 16:41:29 yeah 2025-07-16 16:41:29 ie, not just one statement per connection 2025-07-16 16:42:30 but what would that be? and why would that connection be shared with the thing serving the normal http requests? 2025-07-16 16:43:00 i mean, i assume the thing that has the connection limit holds a pool of db connections or something 2025-07-16 16:43:12 It's postgresql itself that has the limit 2025-07-16 16:43:26 So everything connecting to the DB shares the same pool 2025-07-16 16:44:03 there is a proxy you can put in front of it to avoid that issue 2025-07-16 16:44:05 right, that comes back to everything that connects to it consuming a slot, so if something is misbehaving in gitlab, boom 2025-07-16 16:45:04 i'm trying to say, maybe the connection option in gitlab where gitlab itself, in each connection, tells pg to timeout (maybe not statement_timeout, but connectin_timeout) -- could mitigate 2025-07-16 16:45:48 actually scratch that point about connection_timeout, it's not an option it tells to pg, it's just a timeout it uses for the client when it's trying to reach the db 2025-07-16 16:46:15 right 2025-07-16 16:46:23 statement_timeout is a thing in postgres https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-STATEMENT-TIMEOUT which is why i think it might help, assuming that the rails setting does that on the connection 2025-07-16 16:46:41 but it already does that, given the above error message 2025-07-16 16:47:26 hm. maybe https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-IDLE-SESSION-TIMEOUT then? 2025-07-16 16:47:46 the assumption being -- maybe gitlab leaves sessions open doing nothing except consuming sockets 2025-07-16 16:49:31 There are plenty of idle connections, but I'm not sure it's necessary to kill them 2025-07-16 16:50:01 only if they perform a denial of service on new ones :) 2025-07-16 16:50:28 In the past, it was sufficient to raise the max connections from 100 to 200 2025-07-16 16:50:50 alright, i'll defer to that 2025-07-16 16:51:05 i don't have all the information anyway, just speculating 2025-07-16 16:51:19 lotheac: I mean, if you kill the idle connection, but it then needs to restablish the connection again anyway to perform a query, you do not gain a lot 2025-07-16 16:51:24 if it works, it works, that's good :D 2025-07-16 16:51:34 true, true 2025-07-16 16:52:26 From what I gather, puma keeps some connections open to serve its clients, and then there maybe other processes that also open a connection to perform work 2025-07-16 16:53:12 usually to find out these kinds of "wtf is that software doing" things it's necessary to perform some more experiments to verify what that software does in certain situations 2025-07-16 16:53:38 it's always fun, but not in production 2025-07-16 16:53:42 but gitlab is battle-tested enough that these things should be taken care of 2025-07-16 16:53:53 you'd think so :) 2025-07-16 16:54:04 My experience confirms that 2025-07-16 16:55:29 The only issue i've noticed is with some run-away git processes that consumed a lot of resources 2025-07-16 16:55:37 "battle-tested software; this should be taken care of" is probably my personal top 1 assumption that i've made and had proven wrong by experience :P 2025-07-16 16:55:51 circumstances always differ 2025-07-16 16:56:42 sure, hence needing to raise the max_connections 2025-07-16 16:56:47 it's always just this one weird bug in this one weird env, "nobody uses it that way" 2025-07-16 16:57:14 yeah, i don't argue with raising max_connections, i just want to understand why it was necessary 2025-07-16 16:57:32 "In GitLab 16.0, GitLab defaults to using two database connections that point to the same PostgreSQL database." 2025-07-16 16:57:33 but maybe not today :) 2025-07-16 16:57:42 it's getting late 2025-07-16 16:57:55 The default was sufficient, but since 16.0 we need to double it 2025-07-16 16:58:28 sure, i agree with that reasoning 2025-07-16 16:58:44 (but i still want to know what it's doing, eventually) 2025-07-16 16:58:49 And of all the load issues we've had, postgres was never involved 2025-07-16 16:59:11 I do see ~50-60 connections (most of them idle) 2025-07-16 16:59:28 yup, makes sense 2025-07-16 16:59:38 sorry for the diversion :D 2025-07-17 10:40:42 gitlab has problems 2025-07-17 10:54:56 Checking, im away now, so see what i can do 2025-07-17 11:09:11 CI is broken still 2025-07-17 11:09:14 Pulling docker image registry.alpinelinux.org/alpine/infra/docker/alpine-gitlab-ci:latest ... 2025-07-17 11:09:14 WARNING: Failed to pull image with policy "always": Error response from daemon: unknown: 404 page not found (manager.go:250:1s) 2025-07-17 11:09:14 12 2025-07-17 12:45:06 Should be fixed now 2025-07-17 12:46:01 i guess gitlab is down 2025-07-17 13:07:41 down again 2025-07-17 13:08:04 HTTP 502: Waiting for GitLab to boot 2025-07-17 13:14:32 well that'd explain why i'm having issues pulling down alpine-mksite 2025-07-17 14:21:43 it looks like gitlab server is overloaded 2025-07-17 14:23:07 load average: 152.55, 133.47, 131.05 2025-07-17 14:23:29 we need 150x cpu power to keep up with the current work load 2025-07-17 14:30:58 whats up with zabbix agent? 2025-07-17 14:33:33 ncopa: are you using gdu? 2025-07-17 14:33:57 no 2025-07-17 14:34:01 dont know what it is 2025-07-17 14:34:11 maybe ikke ? 2025-07-17 14:34:29 looks like some dir analyzer 2025-07-17 14:35:24 Yes, to track directory sizes 2025-07-17 14:35:50 now traefik is the main process, i guess its some kind of ddos 2025-07-17 14:36:03 Yes, exactly what happened before 2025-07-17 14:36:13 go-away is not helping in this case? 2025-07-17 14:39:01 maybe they found a way to bypass go-away. it was just a question of time 2025-07-17 14:41:07 gdu is execute by zabbix 2025-07-17 14:41:35 I think zabbix agent scripts is not able to complete in time 2025-07-17 14:42:16 2025/07/17 14:41:58.442632 plugin 'Cpu': time spent in collector task 1.005811 s exceeds collecting interval 1 s 2025-07-17 14:42:16 2025/07/17 14:41:58.564337 plugin 'Proc': time spent in collector task 32.461093 s exceeds collecting interval 1 s 2025-07-17 14:42:55 yeah i saw zabbix had high cpu 2025-07-17 14:43:10 then after gdu popped up 2025-07-17 14:50:59 so, the site is down now anyway right? maybe we can stop the traefic container? 2025-07-17 14:51:07 then it should all calm down, right? 2025-07-17 14:51:15 nothing can connect 2025-07-17 14:55:11 it could be as simple as it is running out of memory 2025-07-17 15:01:14 yes, I think we are running out of memory 2025-07-17 15:01:49 I enabled 4G swap in zram 2025-07-17 15:02:14 an immediately all 4G was used 2025-07-17 15:04:51 im gonna stop the traefik container 2025-07-17 15:04:56 Is there no other way for anyone to push anything when gitlab is down? 2025-07-17 15:05:08 not really 2025-07-17 15:05:20 and the builders cannot pull either 2025-07-17 15:06:06 I already stopped everything before 2025-07-17 15:06:34 It's a temporary measure 2025-07-17 15:06:58 Oh, i thought the builders could pull from git.a.o instead 2025-07-17 15:07:24 but git.a.o would not be able to pull from gitlab... 2025-07-17 15:07:49 Ok 2025-07-17 15:09:13 im stopping traffic only now 2025-07-17 15:09:55 Maybe i should stop (auto-)refreshing gitlab 2025-07-17 15:10:03 ACTION closes the tabs for gitlab.a.o 2025-07-17 15:11:57 i have stopped traefik so it should not accept new connections 2025-07-17 15:12:20 there are still lots of processes running 2025-07-17 15:12:25 and cpu load is still high 2025-07-17 15:16:19 so, if I have stopped traefik, it can no longer accept new connections, right? 2025-07-17 15:16:46 and the current connections/jobs, will eventually finish whatever they are doing, right? 2025-07-17 15:16:54 so load should go down? 2025-07-17 15:17:17 why does not that happen? 2025-07-17 15:20:30 Needs more swap space? 2025-07-17 15:20:58 I added 4G swap and it ate it within a second 2025-07-17 15:21:07 zram swap 2025-07-17 15:21:59 what I find a bit strange is that I stopped traefik, so it no longer accept any new connections. but load does not go down 2025-07-17 15:24:36 I saw a message about gitlab being upgraded to version 18 a few days ago, could it be some issue with this new version? 2025-07-17 15:24:56 i doubt 2025-07-17 15:25:11 maybe just reboot the machine? 2025-07-17 15:25:26 well, could be that gitaly or something uses more memory and it adds up 2025-07-17 15:26:00 number of gitaly processes is slowly going down 2025-07-17 15:26:09 82 initially and now it is 80 2025-07-17 15:26:20 maybe I should stop the gitaly container 2025-07-17 15:28:04 ok, now it is coming back slowly 2025-07-17 15:28:52 I think problem is that it uses too much memory 2025-07-17 15:29:08 adding disk based swap will probably make things worse 2025-07-17 15:30:03 ikke: do you know if there is some way to limit number of gitaly instances? 2025-07-17 15:31:10 maybe I should use the opportunity and upgrade kernel and reboot 2025-07-17 15:32:27 i restarted traefik 2025-07-17 15:35:15 its back again for now 2025-07-17 15:41:29 i guess the crawlers are using the repo to fetch information, and when gitlab cant keep up lots of gitaly processes are just waiting to get served. 2025-07-17 15:46:04 how can I get stats like requests per second from traefik? 2025-07-17 15:46:38 I suspect that the server is underspec'ed 2025-07-17 15:47:25 gitlab docs says 1000+ users should have 16G ram 2025-07-17 15:47:33 we have 13k users 2025-07-17 15:48:25 19k issues 2025-07-17 15:48:41 91k merge requests 2025-07-17 15:48:52 317k pipelines 2025-07-17 15:57:21 rate-limiting was removed in gitlab 18. maybe we need to tweak the concurrency settings https://docs.gitlab.com/administration/gitaly/monitoring/#monitor-gitaly-rate-limiting-removed 2025-07-17 16:00:28 problem is coming back 2025-07-17 16:33:31 I'm only available later, sorry 2025-07-18 08:00:36 ikke: should we move gitlab to the new box in germany? 2025-07-18 08:00:53 or is that difficult with backups and such 2025-07-18 09:09:14 Pulling docker image registry.alpinelinux.org/alpine/infra/docker/gitlab-runner-helper:latest ... 2025-07-18 09:09:15 ERROR: Job failed: failed to pull image "registry.alpinelinux.org/alpine/infra/docker/gitlab-runner-helper:latest" with specified policies [always]: Error response from daemon: unknown: 404 page not found (manager.go:254:0s) 2025-07-18 09:09:15 WARNING: Failed to pull image with policy "always": Error response from daemon: unknown: 404 page not found (manager.go:254:0s) 2025-07-18 09:09:20 something is broken 2025-07-18 09:55:53 Need to start the registry container again 2025-07-18 10:43:50 I can't access it right now, but `docker compose up -d registry` in /srv/compose/gitlab 2025-07-18 10:44:43 done 2025-07-18 10:44:45 thansk! 2025-07-18 18:15:59 clandmeter: ncopa I'm not sure it's a pure lack of resources (other than perhaps disk throughput) 2025-07-18 18:16:06 Moving gitlab has downsides as well 2025-07-18 18:17:59 My suspicion that once some kind of io threshold is read, the load will not go down again 2025-07-18 18:18:38 Once I killed all the /tmp/gitaly-* processes, the load reduces again and everything seems to be fine 2025-07-18 18:19:31 I don't see an incrased amount of requests 2025-07-18 20:48:22 ncopa> rate-limiting was removed in gitlab 18. maybe we need to tweak the concurrency settings https://docs.gitlab.com/administration/gitaly/monitoring/#monitor-gitaly-rate-limiting-removed 2025-07-18 20:49:14 i dont know if you looked at the concurrency settings in gitlab 18. I suspect it would help is we reduced number of concurrent gitaly processes 2025-07-18 20:57:39 Note that we haven't seen any rate limits before as far as I know 2025-07-18 20:57:51 Set* 2025-07-18 21:17:55 nu_: the arm builder is still unreachable 2025-07-18 22:05:01 https://matrix.org/blog/2025/07/security-predisclosure/ 2025-07-18 22:05:49 do you want me to do anything about alpine matrix rooms? 2025-07-18 22:52:10 though I wonder should it be done somehow by oftc ircbot 2025-07-19 14:41:39 also I lost op on all channels 2025-07-20 18:20:44 ikke, ncopa, (etc?) when you have the time and things are stable again? https://gitlab.alpinelinux.org/alpine/infra/alpine-mksite/-/merge_requests/106 2025-07-21 07:12:15 is there some news about the ARM builders? ^^ 2025-07-21 10:27:23 achill: nope :( 2025-07-21 12:32:45 i wasnt aware that they are down 2025-07-21 12:41:20 09:06 < algitbot> che-bld-1.alpinelinux.org: [solved] | Host ICMP unreachable | http://dup.pw/a/17321/9550986 2025-07-21 12:41:23 09:06 < algitbot> che-bld-1.alpinelinux.org: [solved] | Host ICMP unreachable | http://dup.pw/a/17321/9550986 2025-07-21 12:41:41 doesn't this mean that i was solved at the time? 2025-07-21 12:42:01 or are there any more alerts/indication i could take a look at? 2025-07-21 12:44:58 There is one more message later 2025-07-21 12:45:01 che-bld-1.alpinelinux.org: [problem] | Host ICMP unreachable | http://dup.pw/a/17321/9551439 2025-07-21 12:52:24 i try so hard to not have it down even for half a day, and now it was 5 days because of not knowing about it:( 2025-07-21 12:52:38 should be back btw 2025-07-21 12:54:34 Thanks, sorry. What would be a better way to keep you informed about it? I did ping you here on irc this weekend 2025-07-21 12:57:03 not your fault, for some reason irc is not even highlighting the mentions here, and i really should setup some monitoring for the builder availability 2025-07-21 12:59:34 i think some basic check like the last upload on wss://build.alpinelinux.org/ws/ would already be great 2025-07-21 13:24:56 nu_: I can setup our monitoring you directly if you wish 2025-07-21 14:16:04 could you rephrase? i dont get it 2025-07-21 14:17:47 to notify you directly* 2025-07-21 14:35:56 via email? that would be nice 2025-07-21 14:37:41 Sure, yes 2025-07-22 05:40:38 I've increased the amount of pushes before gitaly is trying to optimize repositories. I see there are a lot of pack-object processes running when this happens 2025-07-22 06:02:41 arm builder got a new (used) ups :p 2025-07-22 06:16:27 nu_: nice, thanks 2025-07-22 12:03:49 can someone merge https://gitlab.alpinelinux.org/alpine/infra/gitlab-tf/-/merge_requests/44 2025-07-22 12:03:55 (and deploy) 2025-07-22 12:56:23 achill: done 2025-07-22 13:24:13 thanks 2025-07-22 19:07:23 My current theory after search around a bit regarding the gitlab issues is that it may have to do with the pack file caching we enabled a long while ago 2025-07-23 04:29:45 could someone maybe check on the edge armhf and armv7 builders whether the build is still progressing? they may be stuck? 2025-07-23 08:26:36 https://gitlab.alpinelinux.org/alpine/aports/-/jobs/1945863 2025-07-23 08:26:42 welp why is it erroring again 2025-07-23 08:26:46 ERROR: Job failed: failed to pull image "registry.alpinelinux.org/alpine/infra/docker/gitlab-runner-helper:latest" with specified policies [always]: Error response from daemon: Head "https://registry.alpinelinux.org/v2/alpine/infra/docker/gitlab-runner-helper/manifests/latest" ;: received unexpected HTTP status: 500 Internal Server Error 2025-07-23 08:26:46 (manager.go:254:0s) 2025-07-23 08:27:25 mio: looks like they are in progress (now, at least) 2025-07-23 08:36:24 achill: seems like an intermittent issue 2025-07-23 08:36:32 or depending on the runner 2025-07-23 08:45:53 i see 2025-07-23 13:31:58 thanks for checking on the arm* builders :) 2025-07-25 10:56:27 happy sysadmin day! 2025-07-25 14:24:30 usrhere: thank you 2025-07-25 14:25:26 :) 2025-07-25 14:36:48 usrhere: it has been a while :) 2025-07-25 14:48:20 ikke: yeah, sorry 2025-07-25 14:49:32 life got busy for awhile but I should be able to help with something now 2025-07-25 14:50:16 ping me if you have some entry level tasks 2025-07-25 14:51:02 or even better if you have time for shadow sessions 2025-07-25 15:08:08 happy SAAD day 2025-07-25 15:09:01 although i guess that's a bit redundant. system admin appreciation day day 2025-07-27 09:38:30 the alpine docker registrary seems to be having issues, causing the runners to fail 2025-07-27 10:31:09 Kladky: should be fixed now. I keep forgetting adding a restart policy 2025-07-27 10:31:27 (not that it crashed, but it prevents docker from automatically starting it again on boot / service restart) 2025-07-27 10:37:04 Also reminds me I should setup monitoring for registry.a.o 2025-07-27 21:58:20 Is git push with ipv6 broken for others? 2025-07-27 22:35:58 maybe ikke didn't fix ipv6 for all services? 2025-07-28 10:24:50 I'm not sure why it's not working, it the same mechanism 2025-07-28 10:33:34 Ok, for some reason docker did not allow port 22 via ssh 2025-07-28 10:34:01 I have rules to allow it, but docker inserts its rules before 2025-07-28 19:00:41 Looks like s390x builder hangs on the busybox post-upgrade script 2025-07-28 19:00:48 For Gitlab CI 2025-07-28 19:00:59 https://gitlab.alpinelinux.org/Matthias/aports/-/jobs/1951309#L39 2025-07-28 19:16:55 Kladky: Might be related to the musl -r15 bug I ran into with the builders as well 2025-07-28 20:06:22 basically means we have to pin it at -r14 2025-07-28 21:59:24 Can we add a condition in main/musl to remove the added file on s390x (and check if it helps)? 2025-07-29 14:01:51 i have reverted the musl patch for now 2025-07-29 14:02:16 might need to upgrade the CI runner or something 2025-07-29 14:41:57 The issue happened after musl was upgraded to -r15 in the CI job, so as long as no faulty version is available anymore, it should be fine 2025-07-29 15:22:25 So upgrading to -r16 does not seem to fix it 2025-07-29 15:22:46 ( 1/32) Upgrading musl (1.2.5-r12 -> 1.2.5-r16) 2025-07-29 15:50:39 yeah 2025-07-29 15:50:44 im looking into it 2025-07-29 15:51:25 Alternative would be me pinning musl on <=-r12 2025-07-29 15:52:46 but it would be still broken on basically every s390x machine 2025-07-29 15:52:56 yup 2025-07-29 15:53:01 It needs to be fixed 2025-07-29 15:53:07 but at least to get CI going again 2025-07-29 15:53:27 yeah i guess 2025-07-29 16:10:06 Maybe a gcc/binutils change was the actual cause? 2025-07-29 16:11:03 achill already theorized binutils, but apparently that was upgraded after the musl change 2025-07-29 16:11:42 Based on the commit dates it must be the GCC 15 upgrade 2025-07-29 16:17:06 Is there any way to get the /lib/ld-musl-s390x.so.1 from -r14 as reference? 2025-07-29 16:17:41 sertonix[m]: in #-devel I posted a link to the -14 packages 2025-07-29 16:17:51 https://dev.alpinelinux.org/~kevin/musl-r14-s390x/ 2025-07-29 16:23:03 hmm ill try rebuilding musl with gcc14 from 3.22 2025-07-29 16:29:34 yeah rebuilding musl with gcc14 works 2025-07-29 16:29:49 thats not going to be fun 2025-07-29 16:30:13 F.U.N. 2025-07-29 17:32:15 It somehow ends up at ldso/dynlink.c:2078 which is a for(;;); loop 2025-07-29 17:50:53 good to know yet still i have no clue where to look next 2025-07-29 17:50:57 archill: Cou you try this patch? www.openwall.com/lists/musl/2024/10/10/6 2025-07-29 17:51:12 sure 2025-07-29 17:54:51 ooo i guess that actually fixed it 2025-07-29 18:00:28 !87954 2025-07-29 18:04:18 thanks sertonix[m] 2025-07-29 18:48:29 Is there maybe some manual action needed before the s390x CI starts working again? 2025-07-29 18:50:13 no 2025-07-29 18:50:24 other than kill the jobs that are still deadlocked 2025-07-29 18:50:39 i killed one, not sure if there are other jobs 2025-07-29 18:51:01 but the queue is already from about ~28 to 25 2025-07-29 19:15:08 12 left 2025-07-31 01:33:52 it would be nice if we could run an alpine-managed matrix bridge instead of the current setup, that way we could effectively manage the spam problem 2025-07-31 04:26:11 ikke: been a few weeks since you asked (sorry) but i can build rootlessly in kube with buildah; it just needs to use vfs storagedriver http://lotheac.fi/s/buildah.yaml 2025-07-31 04:27:11 this is essentially the same thing that https://docs.gitlab.com/ci/docker/buildah_rootless_tutorial/ notes 2025-07-31 04:28:32 Are there any downsides to the vfs driver? I understand it does not support overlays, but that should but matter a lot when building images 2025-07-31 04:29:00 i don't think there are any relevant ones. maybe performance 2025-07-31 04:29:43 years ago at $COMPANY with docker we also had to use vfs (or aufs) instead of overlayfs because overlayfs would not work on zfs, and it worked fine 2025-07-31 04:30:42 docker's upstream docks say "The vfs storage driver is intended for testing purposes, and for situations where no copy-on-write filesystem can be used. Performance of this storage driver is poor, and is not generally recommended for production use." but eh. https://docs.docker.com/engine/storage/drivers/select-storage-driver/ 2025-07-31 04:31:35 i suppose the other option would be to provide /dev/fuse to the container to allow it to use fuse-overlayfs 2025-07-31 04:31:49 but least-privilege wise i think vfs just makes more sense 2025-07-31 04:32:00 yeah, agreed 2025-07-31 05:37:38 Ariadne: nu_ was working on setting up something for us 2025-07-31 05:37:48 or maybe not a bridge 2025-07-31 05:38:22 is there anyway we can accelerate this process? moving to our own matrix rooms and bridge will enable us to deal with the spam more effectively 2025-07-31 05:38:48 OFTC is only keeping it around for us and f-droid 2025-07-31 05:39:02 so if we have our own bridge... well you see where i am going :) 2025-07-31 05:39:18 What would be involved setting up a bridge? 2025-07-31 05:39:33 i guess we need our own homeserver and bridge 2025-07-31 05:39:44 since pmOS already has made the switch, they can probably tell us what we need? 2025-07-31 05:44:24 you most likely want heisenbridge 2025-07-31 05:44:26 https://github.com/hifi/heisenbridge 2025-07-31 05:45:29 or appservice-irc if you plan on puppeting 2025-07-31 14:08:43 Ariadne: it was also offered that pmos could just host the bridge 2025-07-31 14:08:49 in fact, im using it right now 2025-07-31 14:09:58 but we never reallllly heard back so we dropped the ball, but afaik the infra team would still be interested 2025-07-31 14:35:33 achill: it's mostly because it's unfamiliar, so we do no not really know what we want 2025-07-31 14:36:15 we could just kill the matrix bridging for good /somewhat-serious 2025-07-31 14:36:48 And achieve what? Some users unable to communicate with others? 2025-07-31 14:37:10 what prevents them from opening webirc? 2025-07-31 14:37:15 or another irc client? 2025-07-31 14:38:04 who is unable to communicate with who 2025-07-31 14:38:39 Users preferring matrix with users preferring irc 2025-07-31 14:38:44 the canonical form of communication is gitlab.a.o anyway so I don't understand that argument 2025-07-31 14:40:01 i think if the effort of upkeeing bridging for total of 5 matrix users is too high, the bridge should be gone 2025-07-31 14:40:09 I mean, you are here now on irc, communicating, not on gitlab 2025-07-31 14:40:24 right but that's because I'm on: gitlab, irc and matrix 2025-07-31 14:40:33 there are alpine devs who are just on gitlab 2025-07-31 14:40:54 thats a whole topic of its own 2025-07-31 14:42:00 ikke: i see, well fwiw our infra team is happy to assist 2025-07-31 14:42:01 I don't mean to start a thread about that specific topic, but just stating that baseline is gitlab-ao and anything more is additional stuff 2025-07-31 14:42:34 irc channels have been established long time ago and almost everyone is using irc 2025-07-31 14:43:05 A lot of collaboration happens via chat, not via gitlab 2025-07-31 14:43:11 matrix is currently the platform that spams the most and is the most unstable one 2025-07-31 14:44:23 ikke: yes, i mentioned irc, irc was, is and will be here because almost everyone is using irc 2025-07-31 14:45:26 what I'm trying to point out is that you're getting yourself into more maintenance effort of additional services for approximately 5 people that actually use matrix 2025-07-31 14:45:32 A significant amount of users are on irc via the matrix bridge (signifcanlt more than "5" users) 2025-07-31 14:46:16 panekj: That's why we're open to outsource it 2025-07-31 14:46:59 imo, "outsourcing" is still part of "getting yourself into more maintenance effort" 2025-07-31 14:47:23 you don't involve yourself into running the bridge/homeserver but still suffer from matrix issues 2025-07-31 14:47:47 fwiw my views on matrix are known, but if we had a well-managed (emphasis on that) bridge, that would be better than the status quo, at least for the purposes of alpine 2025-07-31 14:48:02 as long as someone is willing to deal with running it 2025-07-31 14:48:38 but for the purpose of entertaining idea of not killing the bridging, you can try to have it run by pmos 2025-07-31 14:48:51 and see how it behaves 2025-07-31 14:50:48 I don't think we really have a choice without alienating quite some users. 2025-07-31 14:51:20 That you are willing to ignore that does not mean we are 2025-07-31 14:51:20 you don't but you should not think that it is your problem 2025-07-31 14:52:43 I've tried really hard to keep bridge alive for other project by running moderation tools and then replacing bridge with own but in the end it's too much effort and people will just shit on the project without knowing the context 2025-07-31 14:58:25 panekj: have you run mjolnir or are you talking about other tools 2025-07-31 14:58:50 (sorry if my knowledge is not equal to my disdain for the subject matter) 2025-07-31 14:59:02 i have run mjolnir (and pantalaimon for mjolnir) 2025-07-31 14:59:17 sorry, I should have said "moderation tools" 2025-07-31 14:59:35 it's hardly moderation except that it bans in all rooms 2025-07-31 15:00:00 kline equivalent. which is the minimum you need 2025-07-31 15:00:45 it's too bad the ban syncing is just sitting waiting to be merged. 2025-07-31 15:03:16 the minimum i need is not having matrix anywhere 2025-07-31 15:03:32 or at least not m.org 2025-07-31 15:03:54 it increasingly sounds like we're not living in a matrix-optional world anymore. 2025-07-31 15:04:09 panekj: i was gonna say ”based” but then didn’t 2025-07-31 15:04:24 hopefully the pay requirement for more features will kill m,org for most 2025-07-31 15:05:02 lotheac: i hate matrix as a proto and org and software and... I think i just hate matrix at this point 2025-07-31 15:05:09 all around 2025-07-31 15:05:26 i understand 2025-07-31 15:05:30 it caused me severe mental pain even by just using it 2025-07-31 15:06:08 but if you go with own homeserver and disable federation completely, you cut off about half of the bad stuff 2025-07-31 15:06:42 and it's mostly just bad homeserver software and bad clients software (at least IMO) 2025-07-31 15:06:43 in the days of old, irc operators would not give any shits about the channels. you lost your op? no response 2025-07-31 15:07:10 matrix has no such division of duties 2025-07-31 15:07:35 afaik 2025-07-31 15:08:38 yeah, the federation part then was… not open 2025-07-31 15:08:52 i think it needs to be 2025-07-31 15:10:58 if alpine had a non-federated homeserver, would the existing matrix users use it? 2025-07-31 15:11:28 asking because legit i don't know what the experience is like. 2025-07-31 15:11:31 since most people are on matrix.org i doubt it 2025-07-31 15:13:32 on irc, it's normal to have multiple networks/etc, but i don't know if matrix clients work that way at all. 2025-07-31 15:13:49 you can use multiple network on matrix clients 2025-07-31 15:14:00 if it's a pain for them to switch to alpine's server, i can see why people wouldn't use it. 2025-07-31 15:14:00 or rather 2025-07-31 15:14:33 you generally can use only single user account that is tied to specific homeserver (usually matrix.org) that can join rooms on matrix.org or any other homeserver that is federated with matrix.org 2025-07-31 15:15:06 but tbh I'm unsure what problem is alpine trying to fix 2025-07-31 15:15:12 yeah, so it would be pain. 2025-07-31 15:15:21 if people from matrix are so precious then changing homeserver/bridge will not fix that 2025-07-31 15:15:28 because the spam is still coming from matrix.org? 2025-07-31 15:15:40 and people are still using matrix.org 2025-07-31 15:16:44 btw. it would be nice if a.o had some kind of SSO for wiki/gitlab/matrix 2025-07-31 15:17:18 + grafana and everything else 2025-07-31 15:18:30 best case, matrix.org somehow gets acquired by private equity. 2025-07-31 15:18:47 which i've joked it seems like they're trying to do 2025-07-31 15:18:54 half-joke. 2025-07-31 15:19:41 isn't it basically owned by element 2025-07-31 15:19:54 wherever m.org fails, element has to pick up 2025-07-31 15:20:32 no idea. my joke works in my brain because if that happened, the reaction in the foss community would be necessarily severe 2025-07-31 15:20:55 or, one would hope. 2025-07-31 15:23:24 (the punchline is, technical reasons aren't enough) 2025-07-31 15:30:26 btw, could some information about alpine-zh(Chinese) be added to a.o/community or wiki.a.o? 2025-07-31 17:08:01 i just prefer whatever solution gets rid of the spam vector :) 2025-07-31 19:08:25 i mean, we can bring mjolnir into current channels/rooms, it will help with banning the accounts in all rooms with just single ban 2025-07-31 19:08:42 i dont see how custom homeserver will help if matrix.org is allowed 2025-07-31 19:09:25 although if you remove alpine matrix rooms everywhere from matrix.org then the spammers might not discover the rooms 2025-07-31 19:39:11 that was my thinking 2025-07-31 19:39:20 if we can get them out of the directory 2025-07-31 19:39:23 we should be spammed less 2025-07-31 19:39:39 I'm not 100% confident it will work 2025-07-31 19:40:02 it's not a 100% solution, no