2025-08-01 08:02:59 ikke: do you run something similar to unattended-upgrades on all the servers? and if not - how do you ensure that all the servers have latest security updates installed? 2025-08-01 20:07:02 usrhere: Not really. Most software runs in containers, so unettended-upgrades does little for that. I'm moving things more towards kubernetes, where it's easier to automatically redeploy things. I already run renovate to keep images up-to-date as much as possible 2025-08-01 20:09:49 thanks, makes sense 2025-08-03 00:40:21 Ariadne: ikke: another thing we can do to minimise abuse from matrix side is to require matrix rooms to be accessible only via "matrix space" 2025-08-03 00:40:41 generally most if not all those spammers don't know how to use spaces 2025-08-03 16:21:36 panekj: you can't do that with the portal channels 2025-08-03 16:21:50 I think that requires at least Admin for that? But only the bridgebot has Admin 2025-08-03 16:23:12 btw pmOS has a bot that does cross-room banning 2025-08-06 14:31:02 f_: thank you for destroying my dreams 2025-08-06 14:50:35 pj: you're welcome 2025-08-06 14:50:46 hit me up when you have other dreams you want me to destroy 2025-08-06 14:51:21 btw I'm missing context 2025-08-06 14:59:34 might have been a dream about appservice-irc 2025-08-06 14:59:39 which is weird. 2025-08-06 15:20:05 fwiw, i can't recall why +M and/or +R isn't set, but i think we could be spared a lot of nonsense by using them. then exempt (+e) the webirc hostmask 2025-08-06 15:21:30 i dunno if +e would be needed for the matrix bridge. probably 2025-08-06 15:21:53 because then they can't talk? 2025-08-06 15:23:24 matrix users through the bridge are actual irc users 2025-08-06 15:25:27 i understand; i was writing extemporaneously. 2025-08-06 15:26:40 We have only set those when there were larger spam waves 2025-08-06 15:34:27 is there any argument against them? (if webirc and matrix are exempt) 2025-08-06 15:46:05 well matrix spam still goes through so it's useless 2025-08-06 15:46:36 unless you mean for the skraito stuff? 2025-08-06 15:49:27 not useless for irc spam. channels that have +M/+R do get noticeably less. (it won't stop determined a-holes though) 2025-08-06 15:49:46 invoked: how much irc spam did go through lately? 2025-08-06 15:49:53 Apart from the skraito stuff 2025-08-06 15:50:13 i don't have those statistics 2025-08-06 15:50:42 I'm not looking for statistics, just "not many" or "a little bit" or "a huge lot" 2025-08-06 15:50:43 but #tor for instance sets +R and it makes a difference. the only spam i've seen there in a while has come from matrix. 2025-08-06 15:50:56 +R also is extremely annoying 2025-08-06 15:51:09 so is spam 2025-08-06 15:51:09 OFTC doesn't have sasl, and their nickserv AJOIN thing is broken 2025-08-06 15:51:15 it's more annoying than spam 2025-08-06 15:51:26 because it means I don't autojoin the channel 2025-08-06 15:51:27 i'm more partial to +M but not my house. 2025-08-06 15:51:40 and then it means I miss out on dev chat 2025-08-06 15:51:55 until I notice, "huh why is #alpine-devel not there?" 2025-08-06 15:52:12 I'd be for +M if needed 2025-08-06 15:52:33 but the skraito spam really isn't anything that can't be solved with some spam filtrs 2025-08-06 15:52:36 filters 2025-08-06 15:53:11 alongside +b *skraito*!*@* 2025-08-06 15:53:15 i don't think there will ever be a silver bullet. but the tradeoffs on some things are worth it, imo 2025-08-06 15:53:33 Whatever you do, please don't +R 2025-08-06 15:53:42 not up to me, just teeing the discussion 2025-08-06 15:54:02 I cannot autojoin +R channels because no SASL (and no, certfp does not fix that) 2025-08-06 15:54:24 and I'm sure I'm not alone 2025-08-06 15:54:34 +R or +M or whatever else means anyone on matrix can't talk so I don't understand the conversation 2025-08-06 15:54:35 works on tor. :-/ 2025-08-06 15:55:16 invoked: #tor you mean? I fail to autojoin that channel 90% of the time and have to remember to rejoin when I wake up and my bouncer reconnects at night 2025-08-06 15:55:27 if I ever notice my bouncer reconnecting, even 2025-08-06 15:55:50 i mean, i never have a problem autojoining over a tor connection. but then i'm not in hundreds of channels 2025-08-06 15:56:02 tor is a special case 2025-08-06 15:56:03 tor or not is irrelevant in this case 2025-08-06 15:56:11 I think? 2025-08-06 15:57:39 pj: well, ones registered via nickserv can 2025-08-06 15:57:49 but autologin is completely broken on the bridge 2025-08-06 15:57:57 yes, but who does that on matrix 2025-08-06 15:58:05 (except for me, achill and probably you) 2025-08-06 15:58:19 I don't use matrix as an IRC bouncer, so no I don't do that 2025-08-06 15:58:27 and you are not authenticated at the moment 2025-08-06 15:58:48 +e bridge hostmask should work, or am i high 2025-08-06 15:58:50 because bridge suck 2025-08-06 15:58:56 invoked: not for +M 2025-08-06 15:59:13 hm 2025-08-06 15:59:18 guess i'm high 2025-08-06 15:59:22 for +R ye 2025-08-06 15:59:26 +s 2025-08-06 15:59:35 but you should use +I instead of +e 2025-08-06 15:59:46 that's for +R as i understood 2025-08-06 15:59:52 i haven't looked at the docs 2025-08-06 15:59:56 [pj] user has identified and verified with services 2025-08-06 16:00:02 are you high, f_? 2025-08-06 16:00:12 panekj: probably 2025-08-06 16:00:39 but f_|mo is not authenticated (: 2025-08-06 16:00:47 neither is f_|pmOS 2025-08-06 16:00:51 or f_[m] 2025-08-06 16:01:06 well, yes, which is why I said "probably" 2025-08-06 16:01:22 and you said I'm not authed 2025-08-06 16:01:34 you weren't last time I checked ;) 2025-08-06 16:02:33 but yeah I don't use matrix for IRC (I use irssi for that), so I don't need to be authed to nickserv on these 2025-08-06 16:03:14 i thought you were using senpai, or did i confuse you with someone else 2025-08-06 16:03:20 I used senpai 2025-08-06 16:03:46 but then I went to try irssi again 2025-08-06 16:03:59 and then went slightly too far in customising? 2025-08-06 16:04:12 (so it doesn't even look like stock irssi anymore) 2025-08-06 16:04:16 as jess said, irssi is an old man 2025-08-06 16:04:22 :p 2025-08-06 16:05:45 come to weechat 2025-08-06 16:05:51 no 2025-08-06 16:05:52 we have 2025-08-06 16:05:53 it's slow 2025-08-06 16:05:54 uhh 2025-08-06 16:05:56 terrible scripts 2025-08-06 16:06:00 and weechat-matrix-rs 2025-08-06 16:06:09 weechat is slow 2025-08-06 16:06:23 less slow than element but still slow 2025-08-06 16:06:30 you are slow 2025-08-06 16:06:32 it works fine 2025-08-06 16:06:43 It was very slow when I had it join 500 channels 2025-08-06 16:07:06 and irssi is fine anyway 2025-08-06 16:07:12 it only locks up for 2 minutes to load whole Matrix HQ room :> 2025-08-06 16:07:22 I have, uhhhhhh, 10 perl scripts loaded 2025-08-06 16:07:47 i do think weechat is slow with a large # of channels. i remember hearing that before, not just from f_ 2025-08-06 16:07:49 which I left btw. but thanks to matrix:TM: I've been graciously force to experience the feeling of being part of that room 2025-08-06 16:08:26 I could use irssi-matrix 2025-08-06 16:08:29 but nah 2025-08-06 16:09:15 probably all offtopic here though 2025-08-06 16:09:44 as if this channel has any activity 2025-08-06 16:10:04 o/ 2025-08-06 16:10:06 well it's the space for algitbot to practise free speech 2025-08-06 16:10:17 we should not disturb that space :p 2025-08-06 16:10:18 algitbot: I SAID HELLO 2025-08-06 16:10:30 algitbot: you ok? 2025-08-06 16:10:39 #o 2025-08-06 16:10:42 \o 2025-08-06 16:10:45 \o/ 2025-08-06 16:10:50 smh 2025-08-06 16:11:21 though is algitbot connected via tls 2025-08-06 16:45:16 I don't think so 2025-08-06 16:45:58 It does not have tls support 2025-08-06 16:46:53 drats 2025-08-06 16:46:58 :> 2025-08-06 16:47:55 faq docker 2025-08-06 16:48:19 mhm 2025-08-06 16:53:15 ikke: uh, that's bad 2025-08-06 16:53:29 doesn't work with stunnel or similar? 2025-08-06 16:57:01 it would be good to import sircbot to gitlab or replace it with something else :> 2025-08-06 17:00:10 nah sircbot is nice 2025-08-06 17:00:17 with stunnel it makes for a nice companion 2025-08-06 17:01:18 other then that I have to restart it from time to time because it no longer responds to certain things 2025-08-06 17:04:49 strange never got that 2025-08-06 19:06:58 lotheac: I was trying the gitlab-runner deployment again, but before I did that, I updated the cluster. Now it seems (not sure if related to the upgrade or something else), the network is broken. Several services complain that 10.96.0.1:443 is not reachable, but not sure why yet 2025-08-07 01:03:50 ikke: that’s generally the api server’s internal address. problem with CNI? 2025-08-07 03:15:18 ikke: please let me know what you did to update so that i can try to repro it 2025-08-07 05:48:49 lotheac: to update it, I changed the k0s version to: "version: 1.33.3+k0s.0" 2025-08-07 05:50:07 Note that I have another clusted ugpraded to the same version without issues 2025-08-07 05:50:29 After upgrading, I could not get any logs for any pods, nor was the metric server returning any results 2025-08-07 05:51:16 After searching, I saw that that could be related to the api externalAddress, which I noticed was not set. So I set it, applied again 2025-08-07 05:52:23 did that help? 2025-08-07 05:53:01 The first problems, but then some workloads had the issue with contacting the api server internally 2025-08-07 05:54:49 I stopped the cluster, because 2 out of 3 controllers started using a lot of CPU (and spamming the logs a lot) due to those connection issue 2025-08-07 05:55:46 i would investigate the CNI containers/logs to see if there is any hints there. the apiserver being unavailable at the internal address is just the symptom of the internal network being broken 2025-08-07 05:56:20 i gotta do some paid work right now, so i'll get back to you later :) 2025-08-07 05:57:15 same for me 2025-08-07 11:42:44 lotheac: On one of the nodes: [ERROR][24126] cni-plugin/plugin.go 593: Final result of CNI DEL was an error. error=error getting ClusterInformation: Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0.1:443: connect: connection refused 2025-08-07 11:44:40 The calico-kube-controller is in a CrashLoopBackoff 2025-08-07 12:08:07 it seems like something of a chicken and egg that the CNI containers on any one node would crash if they cannot reach the apiserver through an address that a working CNI itself is supposed to provide :P 2025-08-07 12:08:19 so maybe that is not the root problem 2025-08-07 12:08:57 s/provide/make routable/ 2025-08-07 12:09:21 Yeah, I'm trying to figure out how the 10.96.0.1 address is supposed to be routed 2025-08-07 12:09:29 ECONNREFUSED sounds interesting though, so there _is_ something at that address 2025-08-07 12:09:43 because it responds with RST 2025-08-07 12:09:46 ahuh 2025-08-07 12:09:49 so it is routable, but... 2025-08-07 12:11:08 i don't recall how exactly k0s provides the apiserver, but i would assume that all the control plane nodes have _some_ process on them meant to listen for apiserver requests 2025-08-07 12:11:44 it could be iptables/nft too, of course 2025-08-07 12:12:39 tcp 0 0 :::6443 :::* LISTEN 11298/kube-apiserve 2025-08-07 12:13:02 But I'm not sure how 10.96.0.1 is routed 2025-08-07 12:13:21 the CNI is supposed to handle that part 2025-08-07 12:14:02 depending on what CNI, it usually does it with iptables or nft or ebpf and some coordination between nodes 2025-08-07 12:14:18 or userland proxies 2025-08-07 12:14:32 But what is listening on port 443? 2025-08-07 12:17:34 it's probably supposed to be the apiserver; cni's handle port redirection stuff too 2025-08-07 12:21:30 the fact that it listens to 6443 on the host addresses most likely matters very little 2025-08-07 12:22:28 unless of course connectivity to those host addresses is being blocked by other firewall rules :) 2025-08-07 12:22:53 There are no other firewall rules, everything is managed by k0s / kubernetes 2025-08-07 12:26:29 in that case, sounds like the CNI is not doing something it should be doing 2025-08-07 12:27:36 you're able to access the k8s api though? 2025-08-07 12:27:39 yes 2025-08-07 12:27:48 Externally everything seems to be working 2025-08-07 12:27:59 It seems internal routing is broken somehwow 2025-08-07 12:28:31 you could kubectl describe -n default svc/kubernetes and it will probably tell you that that svc is 10.96.0.1:443 just to verify 2025-08-07 12:29:02 but yeah it sounds like the pods cannot reach the internal network in that case 2025-08-07 12:29:14 which could mean that it is a problem with the container runtime 2025-08-07 12:30:28 try something like kubectl run --rm -ti --image=curlimages/curl:latest foo -- -v https://google.com/ 2025-08-07 12:30:34 I completely stopped the control plane 2025-08-07 12:30:59 that would tell you if the container network can reach external networks 2025-08-07 12:32:59 That command seems to hang 2025-08-07 12:33:11 like it does not even create the container 2025-08-07 12:33:26 can you kubectl describe pod/foo in another terminal to see what's going on 2025-08-07 12:34:01 This is svc/kubernetes: https://tpaste.us/8Bg8 2025-08-07 12:34:41 yeah, that's exactly what i expected: the service is being provided on 10.96.0.1:443, and the targerport (what the backend is actually listening on) is 6443 2025-08-07 12:34:47 targetport* 2025-08-07 12:35:11 lotheac: describing on the pod results in similar errors not being able to reach 10.96.0.1:443 2025-08-07 12:35:13 and the CNI is supposed to handle that redirection/forwarding 2025-08-07 12:35:23 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "248d06b3a4e8dbca7625cec572c03e6e9560e501e0eb5b3550fe46344625d725": plugin type="calico" failed (add): error getting ClusterInformation: Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0.1:443: connect: connection refused 2025-08-07 12:36:21 I did very the calico network itself is working (I can ping each node from each other node via the vxlan.calico interface 2025-08-07 12:36:37 ikke: ok, so maybe that means that the pod was scheduled to run on a different node than the apiserver... which of course it was, and then the apiserver you are contacting can't itself talk to the node it was scheduled on 2025-08-07 12:36:53 are there only controller+worker nodes in this cluster? or workers as well? 2025-08-07 12:37:10 workers as well, I'll taint them 2025-08-07 12:37:53 hm. or maybe pod creation would in any case contact the apiserver, not sure 2025-08-07 12:38:01 even less sure if it should happen via the internal addr like that 2025-08-07 12:38:34 because... the calico pods are supposed to be what provides that routing/connectivity 2025-08-07 12:38:38 (i think) 2025-08-07 12:39:55 how are you talking to the apiserver again? was it external lb? 2025-08-07 12:40:22 Yes 2025-08-07 12:40:24 does that lb show all its endpoints as healthy? 2025-08-07 12:40:32 it does 2025-08-07 12:41:30 The controller+worker nodes are ofcourse tainted themselves as well, so the test pod would never be scheduled there 2025-08-07 12:41:47 right 2025-08-07 12:41:52 it would need a toleration 2025-08-07 12:42:46 i was probably chasing the wrong thread there anyway, i think the error message is saying that the container runtime itself wants to contact the apiserver at 10.96.0.1:443 2025-08-07 12:43:40 on the controller nodes' host network, are you able to establish a tcp connection to 10.96.0.1:443? 2025-08-07 12:44:14 no 2025-08-07 12:44:20 "Could not connect to server" 2025-08-07 12:44:22 that _may_ be related 2025-08-07 12:44:27 but i'm not sure it is 2025-08-07 12:44:47 The host network has no route to 10.96.0.0/12 2025-08-07 12:45:40 ok, i think i need to set up my test k0s cluster again to see if i can get this issue to happen 2025-08-07 12:45:57 we're not getting anywhere like this :) 2025-08-07 12:47:34 I can try to switch it back to kube-router to see if everything on the control nodes is working 2025-08-07 12:48:37 wasn't there some blocker why we switched to calico to start with 2025-08-07 12:48:56 It would not work over dmpvn 2025-08-07 12:49:01 right 2025-08-07 12:53:38 It's a bit confusing as you say, calico itself is unable to start because it cannot reach that address 2025-08-07 12:53:47 cni-installer/install.go 478: Unable to create token for CNI kubeconfig error=Post "https://10.96.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-p 2025-08-07 12:53:49 lugin/token": dial tcp 10.96.0.1:443: connect: connection refused 2025-08-07 12:55:14 probably calico has multiple components and the thing that is relying on the apiserver being available is not the thing that is broken 2025-08-07 12:58:16 But I would expect the CNI installer not to depend on the CNI already running 2025-08-07 12:58:35 correct 2025-08-07 12:59:10 i installed a new cluster with 1.33.3+k0s.0 onto fresh linode nodes and it works fine... sigh 2025-08-07 12:59:21 ci-cplane-1-c0:~# nc -v 10.96.0.1 443 2025-08-07 12:59:21 10.96.0.1 (10.96.0.1:443) open 2025-08-07 12:59:32 hmm, interesting 2025-08-07 12:59:41 Let me reboot the nodes 2025-08-07 12:59:55 I did start and stop the cluster a couple of times 2025-08-07 13:01:37 i'm gonna try creating it at 1.33.1 and seeing if the upgrade breaks it 2025-08-07 13:03:18 What does `ip route get 10.96.0.1` return for you? 2025-08-07 13:03:49 i already destroyed that cluster, let me get back to you in a sec :D 2025-08-07 13:10:16 So I removed the api externalAddress, and calico-node is running again :/ 2025-08-07 13:11:06 But, I cannot get any logs of pods, nor is the metric service working 2025-08-07 13:11:10 I'm confused 2025-08-07 13:11:25 So all workloads are happy 2025-08-07 13:13:13 ok, so pod->internal net routing works, and so does lb->apiserver (otherwise you would not be able to see the pod statuses either), but maybe the apiserver pod cannot reach the kubelet api (running on each node) to get you the logs 2025-08-07 13:13:41 s/apiserver pod/apiserver/, might not be a pod in itself on k0s 2025-08-07 13:14:32 isn't, in fact 2025-08-07 13:14:53 ci-cplane-1-c2:~# ip route get 10.96.0.1 2025-08-07 13:14:53 10.96.0.1 via 172.233.50.1 dev eth0 src 172.233.50.200 uid 0 2025-08-07 13:14:53 cache 2025-08-07 13:15:09 this is a 1.33.1+k0s.0 upgraded to 1.33.3+k0s.0 cluster 2025-08-07 13:15:36 panic: unable to load configmap based request-header-client-ca-file: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 10.96.0.1:443: 2025-08-07 13:15:40 172.233.50.1 is the default gateway, so i think the routing table tells us exactly nothing 2025-08-07 13:15:46 yeah 2025-08-07 13:15:56 as in the CNI probably intercepts this before it actually leaves the machine 2025-08-07 13:16:29 nc at least says the ports are open now 2025-08-07 13:16:41 on the host 2025-08-07 13:17:33 seems like calico is (at least in this configuration) using iptables for its stuff, based on iptables -L or nft list ruleset 2025-08-07 13:21:23 fwiw... kubectl logs can fetch any pod's logs for me in this configuration without issue 2025-08-07 13:23:30 If you do nsenter -t -n 2025-08-07 13:23:34 can you ping 10.96.0.1? 2025-08-07 13:23:41 I used the metrics-server pod 2025-08-07 13:24:11 I cannot even ping the default gw 2025-08-07 13:24:25 but I can curl google.com 2025-08-07 13:24:48 ❯ kubectl run --overrides '{"spec":{"tolerations":[{"operator":"Exists"}]}}' --rm -ti --image=curlimages/curl:latest foo -- -v https://google.com/ 2025-08-07 13:24:50 worked fine 2025-08-07 13:24:57 let me see about metrics-server 2025-08-07 13:25:36 ok, this works: nc -v 10.96.0.1 443 2025-08-07 13:26:07 yeah, ping doesn't work for me either 2025-08-07 13:26:12 https://termbin.com/g98l 2025-08-07 13:28:19 Error from server: Get "https://172.16.250.10:10250/containerLogs/kube-system/metrics-server-5f45c7b665-qvg9k/metrics-server": No agent available 2025-08-07 13:28:52 So probably an issue with konnectivity-agent 2025-08-07 13:29:43 FYI, that's the dmvpn gre address of the node 2025-08-07 13:30:43 right, 10250 is the kubelet api port 2025-08-07 13:31:46 i have no idea what "No agent available" means though 2025-08-07 13:34:02 based on some searching i am assuming that's an error message actually returned from the http api (as opposed to some errno that happened while trying to establish a connection) and it indeed seems related to konnectivity 2025-08-07 13:34:17 so are the konnectivity-agent pods on each node ok? 2025-08-07 13:34:19 And now suddenly it's working 2025-08-07 13:35:00 yes, they are all running, and just a single line in the logs 2025-08-07 13:35:00 ha :D sounds like kube networking alright 2025-08-07 13:35:10 1 clientset.go:285] change detected in proxy server count (was: 0, now: 3, source: "KNP server response headers") 2025-08-07 13:36:23 i searched for that string and found only https://irclogs.alpinelinux.org/%23alpine-infra-2025-06.log 2025-08-07 13:37:20 typical 2025-08-07 13:37:44 fun :p 2025-08-07 13:37:45 restarting k9s (which I use to inspect the state) and no longer working 2025-08-07 13:37:57 So I assume it has to do with which node I end up with through the LB 2025-08-07 13:39:12 https://docs.k0sproject.io/head/high-availability/ 2025-08-07 13:39:29 It does mention the api externalAddress 2025-08-07 13:39:42 you could verify that hypothesis by port-forwarding and trying to talk to each node individually... but by "no longer working" what do you mean as the failure mode 2025-08-07 13:40:06 metrics not available, not able to obtain logs 2025-08-07 13:40:41 ah as in the apiserver itself responds and is healthy, but is not able to reach the kubelet on the node that metrics-server is running on 2025-08-07 13:41:09 yeah you'd have about a one in three chance of hitting the same node that it's on, and if it doesn't work otherwise... 2025-08-07 13:42:12 Error from server (BadRequest): previous terminated container "metrics-server" in pod "metrics-server-5f45c7b665-qvg9k" not found 2025-08-07 13:42:18 Error from server: Get "https://172.16.250.10:10250/containerLogs/kube-system/metrics-server-5f45c7b665-qvg9k/metrics-server?previous=true": No agent available 2025-08-07 13:42:21 'kubectl get node -o wide' should tell you what the node internal addresses are considered to be -- and then you could try to see if you can reach them from another node using those addresses on :10250 2025-08-07 13:44:24 The dmvpn gre addresses, I can curl to each node (get a 404 back) 2025-08-07 13:44:24 maybe the in-container network cannot route to the gre internal addresses (or return traffic of that is not routed properly) 2025-08-07 13:48:37 my metrics-server is at pod ip 10.244.63.129 on the c0 node... and routable from c0 itself and c2 host netns https://termbin.com/k9cm 2025-08-07 13:50:14 trying to find the logs for konnectivity-server 2025-08-07 13:50:17 and conversely the pod network can reach either node https://termbin.com/7az4f 2025-08-07 13:51:52 https://tpaste.us/r5mL 2025-08-07 13:52:29 likely a symptom rather than a cause 2025-08-07 13:52:31 yes 2025-08-07 13:57:58 checking all the config 2025-08-07 13:58:43 This seems to be correct: https://tpaste.us/PQPx 2025-08-07 14:00:55 dial tcp 172.235.190.129:6443: connect: connection refused 2025-08-07 14:01:05 That seems to connect to the external ip 2025-08-07 14:01:21 my current thinking is: we are observing a failure with something from container netns trying to talk to a host dmvpn address (that or the return traffic of such) 2025-08-07 14:01:55 _different_ host, that is 2025-08-07 14:03:17 so i don't really know how dmvpn does its magic, but since calico is doing its own stuff with iptables... maybe there's some kind of conflict or assumption broken there 2025-08-07 14:03:39 what i'm confused by is, how did this start happening with the upgrade? 2025-08-07 14:07:06 https://termbin.com/1ust that's the entirety of the host nftables in a cluster _without_ dmvpn, maybe it's different with it - don't know 2025-08-07 14:07:13 i gotta head to sleep in a bit 2025-08-07 14:08:05 dmvpn does not set any firewall rules itself 2025-08-07 14:08:10 It's pure routing 2025-08-07 14:08:25 okay 2025-08-07 14:13:45 so if a container netns wants to talk to a dmvpn address of a different node... the host on which that container is running is expected to forward that traffic based on its routing tables 2025-08-07 14:14:29 yes 2025-08-07 14:14:37 and then the target node is supposed to be able to route the return traffic symmetrically 2025-08-07 14:15:37 based on what we've seen the node->pod ip traffic is relying on iptables instead of routing tables 2025-08-07 14:16:28 can you curl a pod ip address from a host different than one it is running on 2025-08-07 14:17:00 it'll be ECONNREFUSED anyway but 2025-08-07 14:17:10 if it times out we know something is wrong 2025-08-07 14:18:20 I can ping it 2025-08-07 14:18:58 curl: (7) Failed to connect to 10.244.157.149 port 80 after 1 ms: Could not connect to server 2025-08-07 14:19:00 curl 2025-08-07 14:19:09 No timeout 2025-08-07 14:19:16 so host->pod on different node routes fine 2025-08-07 14:19:34 but the other direction seems not to? 2025-08-07 14:20:24 from the netns of that pod, try tcp 10250 of the node that you just performed the experiment on 2025-08-07 14:21:20 i'm expecting this to fail, since that's supposed to be the problem we saw, but just sanity checking 2025-08-07 14:23:39 curl https://172.16.250.10:10250 -k -> 404 2025-08-07 14:23:47 from c2 -> c0 2025-08-07 14:23:54 namespace of the konnectivity-agent pod 2025-08-07 14:24:26 well now i'm confused, that means everything works in both directions 2025-08-07 14:24:32 curl https://72.235.190.129:6443 2025-08-07 14:24:34 this times out 2025-08-07 14:24:39 wait 2025-08-07 14:24:41 wrong ip 2025-08-07 14:24:55 ok, so that works as well 2025-08-07 14:25:01 curl https://172.235.190.129:6443 -k 2025-08-07 14:26:00 hmm... ah but i might have been barking up the wrong tree the whole time here. the apiserver process is running in the host ns 2025-08-07 14:26:24 so... what doesn't seem to work is the apiserver->kubelet (different node) traffic, right? 2025-08-07 14:26:39 as in, can't get logs of pod on different node because :10250 not reached 2025-08-07 14:27:14 I think 10250 can be reached, but it responds that it cannot access the agent? 2025-08-07 14:27:26 right, i guess i got confused there as well 2025-08-07 14:27:30 https://docs.k0sproject.io/v1.21.0+k0s.0/networking/ 2025-08-07 14:27:45 i don't really know what the konnectivity agent is supposed to be doing :) 2025-08-07 14:28:07 It facilitates controller <-> worker communication 2025-08-07 14:28:40 "The Konnectivity service provides a TCP level proxy for the control plane to cluster communication." 2025-08-07 14:29:00 I think the agent is supposed to connect back to the control plane 2025-08-07 14:29:13 and then that connection is used to proxy requests 2025-08-07 14:30:00 that sounds wholly unnecessary in this scenario of two nodes who have routable ip addresses between them just trying to connect to a http api on 10250 on each other 2025-08-07 14:30:36 https://docs.k0sproject.io/stable/networking/#controller-worker-communication 2025-08-07 14:31:34 i see 2025-08-07 14:40:53 yeah, i have no idea what's going on. gotta sleep on it and try to figure out konnectivity later 2025-08-07 14:41:14 Trying to see if I can get help from the k0s project 2025-08-07 14:42:16 based on the upstream k8s docs https://kubernetes.io/docs/tasks/extend-kubernetes/setup-konnectivity/ the agents are supposed to connect to a proxy server running on one of the cplane nodes, but i couldn't see any "server" pod for konnectivity in the k0s setup, just agents 2025-08-07 14:42:52 I think the server runs directly on the host 2025-08-07 14:42:57 (that might just be because the server part is provided by each control plane node and not actually shown as a pod, since that's how k0s generally operates) 2025-08-07 14:42:59 yes 2025-08-07 14:43:56 and there's the error string, but i didn't dig into potential causes https://github.com/search?q=org%3Ak0sproject+%22no+agent+available%22&type=code 2025-08-07 14:44:21 It seems to run on all control-plane nodes btw 2025-08-07 14:44:30 anyway... good luck, and later :) 2025-08-07 14:44:42 Thanks for your help 2025-08-07 15:23:58 lotheac: ok, really strange. I added externalAddress now again, and now everything seems to be working 🤨 2025-08-07 15:24:14 (except for konnecitivity-agent on one of the worker nodes 2025-08-07 15:24:16 ) 2025-08-07 18:20:14 Interesting DNS was not working properly. Apparently one of the dns pods was located on one of the workers 2025-08-07 18:20:48 After adding a nodeAffinity for the control plane nodes, dns is working a lot better 2025-08-07 19:21:29 The question then remains why requests to the worker node were not working 2025-08-08 00:00:35 hmm :| 2025-08-08 00:00:57 that sounds weird, the dns pods should work just as well no matter where they are scheduled 2025-08-08 00:01:13 which makes me think there is something still wrong with the network 2025-08-08 13:57:22 just noticed that on the beginning of a x86 ci log:  on shared-runner x86.ci.alpinelinuix.org (x86) -GMdM8s8, system ID: r_kvex70My1Hwh 2025-08-08 13:57:28 the hostname is probably a typo 2025-08-08 17:14:25 fyi: I broke my bananapi bpi-f3 with kernel update, so now one riscv64 runner is down 2025-08-08 17:36:40 and its back 2025-08-08 17:56:29 ok, im testing boot the linux-spacemit kernel from sdcard. I now get this error when booting: 2025-08-08 17:56:46 Retrieving file: /boot/initramfs-spacemit 2025-08-08 17:56:46 Retrieving file: /boot/vmlinuz-spacemit 2025-08-08 17:56:46 append: earlycon=sbi rw root=UUID=16878291-386c-4fd9-a6b0-2d54f341864b rootfstype=ext4 rootwait console=ttyS0,115200 clk_ignore_unused swiotlb=65536 2025-08-08 17:56:46 kernel_comp_addr_r or kernel_comp_size is not provided! 2025-08-08 17:56:46 Retrieving file: /boot/dtbs-spacemit/spacemit/k1-bananapi-f3.dtb 2025-08-11 07:34:49 Gitlab has been upgraded to 18.1 2025-08-11 07:51:35 ERROR: Job failed: failed to pull image "registry.alpinelinux.org/alpine/infra/docker/alpine-gitlab-ci:latest" with specified policies [always]: Error response from daemon: unknown: 404 page not found (manager.go:250:0s) 2025-08-11 09:07:14 can someone please ping me when ^ is fixed, so I can re-run CI for !88539 and !88540 2025-08-11 09:08:07 Or just someone else with permission can re-run them, I don't mind 2025-08-11 09:24:47 wait a while. GitLab was just upgraded to 18.1 and seems unstable—the manual trigger is still on 17.2.1 2025-08-11 09:26:03 Ok, need to run migrations for the registry 2025-08-11 09:28:38 Kladky: should work again 2025-08-11 09:29:29 Thanks 2025-08-12 16:50:21 durrendal: I've used your draft for deploying alpine-mirror-sync and created a playbook with several roles (and some tweaks): https://gitlab.alpinelinux.org/alpine/infra/ansible-playbooks/-/merge_requests/1 2025-08-12 17:44:25 ikke: these look like fantastic adjustments! I'm glad it worked and you were able to so easily extend it. 2025-08-12 17:45:14 I've used this to deploy 2 new servers :) 2025-08-12 17:45:22 https://ltu-t1-1.alpinelinux.org/ 2025-08-12 17:45:27 They're syncing now 2025-08-12 18:03:46 :D love it! It feels really good to have contributed to that! 2025-08-12 18:05:00 Super curious how long it generally takes to fully sync, if you know roughly. I only did dry runs when testing, didn't have the disk space to sync everything 2025-08-12 18:05:21 Really depends on the bandwidth, but roughly one day 2025-08-12 18:12:45 Yeah it's a super fuzzy question, but a day is better than I thought. 2025-08-12 18:13:11 Are the two new mirrors part of the backbone that feeds into the CDN? 2025-08-12 18:13:33 Yeah, they will be 2025-08-12 21:23:41 durrendal: i think last time when i synced it took much longer than a day 2025-08-12 21:23:49 also depending from where you sync ofc 2025-08-12 21:23:58 our master is not that speedy 2025-08-13 09:47:22 I wonder if we want another slightly better CI runner for riscv64? 2025-08-13 09:47:32 I'm thinking of https://pine64.com/product/alpha-one-7b-llm-agentic-generative-ai-agent-eu-version/ 2025-08-13 09:48:42 similar hardware as hifive premiere p550. The CPU is one of the faster ones on the market currently 2025-08-13 09:48:49 and it has 32G ram 2025-08-13 09:50:19 It has passive cooling so I could have it under my desk. 2025-08-13 09:52:33 ncopa: I think having faster HW for rv64 would be hugely beneficial 2025-08-13 09:52:44 You could expense it on cc 2025-08-13 09:52:50 oc* 2025-08-13 09:58:35 Alright, lets do it 2025-08-13 09:59:41 Do we know if we can already boot alpine on it? 2025-08-13 10:03:01 no, but I'm fairly confident we can make it work 2025-08-13 10:03:16 I have a hifive premiere which boots alpine 2025-08-13 10:03:58 the reason I dont want use the hifive premiere as CI is that it has fans, so its a bit noisy 2025-08-13 10:04:07 so I have it powered of most of the time 2025-08-13 10:04:31 and the hifive permiere has only 16G ram 2025-08-13 10:04:48 it was not difficult to get alpine working on it 2025-08-13 10:07:17 I have ordered it 2025-08-13 10:07:43 Nice 2025-08-13 10:21:05 I have spent a few days on a kernel that works both on orangepi rv2 and bananapi bpi-f3 2025-08-13 10:21:08 still not working 2025-08-13 10:21:54 but I found another linux distro which actually works, which use exactly the same kernel sources, so I know it is possible make a kernel config that works (for orangepi rv2 at least) 2025-08-13 10:22:57 I'm setting up the new t1 infra 2025-08-13 10:23:10 Part of it is also updating the fastly config 2025-08-13 10:23:15 I can also check the change you've made 2025-08-13 10:24:38 the change should be ready to go 2025-08-13 10:25:31 i think it can be applied as is. it only needs to be done at a time where we have time to monitor that nothing breaks and have time to revert if needed 2025-08-13 10:25:53 I have time now :) 2025-08-13 10:26:04 then i think you can just push it 2025-08-13 10:27:34 What does the change entail? Is it the redirect from dl-cdn.*/alkpine to cdn/ ? 2025-08-13 10:28:22 IIRC the only thing it does is make cdn.a.o/ and dl-cdn.a.o/alpine reuse same cache objects 2025-08-13 10:28:30 ah 2025-08-13 10:30:56 Ok, so it sets the cache hash for cdn to match the urls for dl-cdn, right? 2025-08-13 10:31:03 yeah 2025-08-13 10:32:21 looking at the diff now 2025-08-13 10:32:31 i dont know why i didnt apply it myself earlier 2025-08-13 10:32:35 its a low risk change 2025-08-13 10:33:01 and only affect if (req.http.host == "cdn.alpinelinux.org") 2025-08-13 10:33:06 yup 2025-08-13 10:33:46 Ok, keeping an eye on the hit ratio 2025-08-13 10:33:50 So we could push it 2025-08-13 10:35:01 Want me to push it? 2025-08-13 10:38:52 done 2025-08-13 10:52:33 thanks! 2025-08-13 11:06:02 ncopa: Not so sure if it works 2025-08-13 11:07:28 I constantly get x-cache: MISS, MISS back for cdn, while getting HIT, HIT back from dl-cdn 2025-08-13 11:57:12 hum 2025-08-14 09:22:37 I have switched rsync.a.o to point to rsync.geo.a.o 2025-08-14 09:23:05 checking how it goes 2025-08-14 09:23:22 geodns is not perfect, so it might cause bandwidth issues, and we still have nothing in asia 2025-08-14 09:36:17 https://imgur.com/a/JH1bUDm :-) 2025-08-14 10:41:45 MIrrors seem to be still updating 2025-08-14 13:00:15 why nothing in asia? lack of organizations willing to sponsor infra there? 2025-08-14 13:00:48 lotheac: we did have some offers, but nothing delivered yet 2025-08-14 13:00:56 ok, cool 2025-08-14 13:03:02 i have some (limited) contact with some technical people in some japanese companies... i __might__ be able to ask some key persons nicely at some future meetups 2025-08-14 13:03:24 but nothing definitive to be sure 2025-08-14 13:04:24 and i imagine not necessary if you already have offers :) 2025-08-14 13:05:01 I think something in Japan would be nice of possible 2025-08-14 13:05:53 alright, i'll keep an eye out :) 2025-08-14 13:09:35 i could host from my apartment, but no public ip's so that's a no-go :p 2025-08-14 13:10:17 and i guess the isp would not appreciate it :) 2025-08-14 13:10:42 hehe, no, I would suppose not :p 2025-08-14 13:16:17 I do see we have at least one offer for JPN, might see if that's still available 2025-08-14 13:19:01 Storage is too low 2025-08-15 10:43:23 ncopa: would have time to verify https://manage.fastly.com/configure/services/4ZVgv5JlGHsuH7lfiWfzy/diff/93,94? 2025-08-15 10:43:27 Adding new backends 2025-08-15 14:37:32 I have activated it, and made another change to loadbalance accross all origins, but it appears it still sends all traffic to the first origin :( 2025-08-15 14:38:38 Ok, only now it says all origins are activated 2025-08-15 14:39:21 But still no requests.. 2025-08-15 14:40:49 Maybe I have to be more patient 2025-08-15 14:58:29 Ok, found it. There was some custom config that set the default backend, which overrides the loadbalancing 2025-08-15 21:33:10 ikke: all working ok? 2025-08-15 21:33:24 clandmeter: yeah, looks alright 2025-08-15 21:51:44 The old t1 servers now serve less that 0.5Gbps 2025-08-18 06:21:18 ncopa: I found out that cdn.a.o/alpine/* and dl-cdn.a.o/alpine/* do share the same cache 2025-08-18 06:21:30 it's only when we leave out /alpine that it's no longer cached 2025-08-18 06:21:42 So something with the logic to detect that /alpine is present is not working 2025-08-18 07:20:43 aha 2025-08-18 07:23:02 You can enable a debug header, and then you can at least see the hash digest 2025-08-18 07:23:07 sadly not the contents 2025-08-18 07:23:17 But that let me confirm this 2025-08-20 09:36:13 oof, gitlab is getting hammered 2025-08-20 09:36:33 urgh 2025-08-20 18:32:31 ,, 2025-08-21 07:47:33 whats up with the ppc64le CI runner? 2025-08-21 07:48:08 some very long running jobs 2025-08-21 07:48:14 It's catching up now 2025-08-21 07:48:21 ah alright nice 2025-08-21 10:26:45 I've removed t1.alpinelinux.org as origin from (dl-)cdn.alpinelinux.org 2025-08-21 12:28:31 ikke: i figured the other day that i need to set up a similar dmvpn test env as you have on your end, since i was unable to hit the same issues otherwise... but i'm puzzled about dmvpn despire trying multiple times to understand the alpine wiki page about it. i imagine it's because i lack any context about the cisco thing that i'm having a hard time grasping it. could i bother you to explain how the alpine dmvpn works? 2025-08-21 12:29:23 lotheac: To be honest, I don't know exactly all the details, just the high-level idea 2025-08-21 12:29:41 the high level details would help a lot :) 2025-08-21 12:31:33 So initially it looks like a hub-and-spoke architecture. You have one or more hubs, where each spoke connects to 2025-08-21 12:31:58 But once a spoke wants to connecto another spoke, it can dynamically setup a direct tunnel between the 2 2025-08-21 12:32:09 So it's a mesh network 2025-08-21 12:33:18 You have a single GRE interface, but it uses multipoint GRE to connect to multiple endpoints via single interface 2025-08-21 12:33:23 so a hub in this nomenclature is the server running opennhrp? 2025-08-21 12:33:41 or do spokes run that too? 2025-08-21 12:33:54 spokes run that as well 2025-08-21 12:34:17 and i need to set up the spoke<->hub gre tunnels myself? 2025-08-21 12:35:10 You have the dmvpn-ca tool that maintains a database of all the sites, vpncs, subnets, certificates and other things 2025-08-21 12:35:26 Once fully configured, you generate a certificate for a hub or spoke that contains all the information 2025-08-21 12:35:43 then you provide that certificate to setup-dmvpn, which makes sure everything is configured, including the gre interface 2025-08-21 12:35:54 i think i might be reading the wrong documentation :) no mention of dmvpn-ca on https://wiki.alpinelinux.org/wiki/Dynamic_Multipoint_VPN_(DMVPN) 2025-08-21 12:36:08 or setup-dmvpn either... 2025-08-21 12:36:52 https://gitlab.alpinelinux.org/alpine/dmvpn-tools i guess i should be reading this instead 2025-08-21 12:36:54 https://gitlab.alpinelinux.org/alpine/dmvpn-tools 2025-08-21 12:36:56 yeahg 2025-08-21 12:37:35 i humbly suggest making the aforementioned wiki page nothing but a link to the repo :p 2025-08-21 12:38:58 (if only because it shows up in web search results) 2025-08-21 12:39:31 anyway, thanks! this helps a bunch 2025-08-21 12:42:40 I've added an "obsolete" marker 2025-08-21 12:42:58 thanks 2025-08-21 12:46:10 Once setup, you can use `vtysh` to get a router-like interface and see the various protocols 2025-08-21 14:03:17 got my alpha-one (riscv64) machine. it boos a debian based RockOS(?) something. Need to find out where they got the kernel sources from 2025-08-21 14:12:32 looks like this should work: https://github.com/jmontleon/linux-rockos 2025-08-21 14:12:38 its only 6.6 kernel though 2025-08-21 17:11:42 Oops, I've upgraded che-bld-1 to alpine 3.22, but it fails to boot the kernel 2025-08-21 17:11:52 "error: invalid magic number" 2025-08-21 17:12:52 https://gitlab.alpinelinux.org/alpine/aports/-/issues/15263 2025-08-21 18:18:57 :( 2025-08-21 19:30:19 ugh 2025-08-21 19:30:42 symbol grub_is_shim_lock_enabled not found :| 2025-08-21 19:46:05 ok, fixed 2025-08-22 19:09:51 ikke: do you know who I would ping about enabling AWS AMI generation for AWS' ca-west-1 region? 2025-08-22 19:21:21 tomalok in #alpine-cloud, but note that they're personally paying the bills for those images 2025-08-22 19:22:24 good background information, probably belongs in the repo somewhere so people know that AMIs might not be available intentionally 2025-08-23 05:01:47 it's mcrute who's been footing the AWS bill for image storage. i'm just the one that's trying to make sure we don't break his bank... ;) i think in the ideal situation there'd be some way individuals could sponsor having images in certain regions (and/or in other clouds' regions) but there's a bit of work to make a system like that and ensure 2025-08-23 05:01:47 trust that the images are truly official images. the other ideal would be to convince the cloud providers to make Alpine images available in the same way that they do for Debian, etc... 2025-08-23 08:10:34 tomalok: I definitely appreciate all the optimization you've done, the costs are pretty sustainable at the moment and I'm definitely fine with full region coverage 2025-08-23 08:12:05 I was debating building an account importer so that people can host their own based on your golden images. I've just mostly been really busy with other stuff and haven't had the time lately. That being said, it's bugged me for a while that we can't host all images indefinitely due to the cost and I want to make a way that people can easily have their own stability guarantee. 2025-08-23 16:02:43 looks like #235 (ngPTQbXZ) x86-64.ci.alpinelinux.org may be running low on disk space? 2025-08-23 16:02:59 https://gitlab.alpinelinux.org/alpine/aports/-/jobs/1983747#L1700 2025-08-23 16:10:20 It still had 20-50G, that was not enough? 2025-08-23 16:10:56 (found that crond was not running on that server, so the daily cleanup did not happen) 2025-08-23 16:13:04 okay, thanks :) 2025-08-23 16:14:48 to the question, maybe not, it's fine on other arches 2025-08-24 10:55:07 An ansible-playbook feature for the grabs: https://gitlab.alpinelinux.org/alpine/infra/ansible-playbooks/-/issues/1 2025-08-24 11:03:46 And another one: https://gitlab.alpinelinux.org/alpine/infra/ansible-playbooks/-/issues/2 2025-08-24 11:04:19 usrhere: raspbeguy: durrendal: ^ 2025-08-24 11:07:44 I'm not familiar with awall. However the second one should be easy I guess. 2025-08-24 11:09:58 The awall one is mostly symlinking policies 2025-08-24 11:10:07 I already saw there's an ansible module for it 2025-08-24 11:19:01 I'll have a look on it. For the time being I'm busy with my newborn daughter so not very active on computer things 2025-08-24 11:22:17 congrats, and no worries 2025-08-24 11:30:10 Thx 2025-08-24 11:35:39 Is it me, or is 2a13:9a40::/32 no longer routable? 2025-08-24 11:41:51 2 total prefixes (0 IPv6, 2 IPv4) 2025-08-24 14:01:22 hi there, whenever I fetch an APKINDEX from cache-ams2100141-AMS, cache-fra-etou8220160-FRA I just never get a reply and the connection is stuck, any other CDN node and it works fine 2025-08-24 14:03:20 adrian: Since when does this happen, do you know? 2025-08-24 14:03:24 ipv4 or ipv6? 2025-08-24 14:05:01 ipv4 and it first happened 5 days ago, though that was when i set up this host so idk if it already occured earlier 2025-08-24 14:06:23 From what region are you connecting? 2025-08-24 14:07:43 I'm connecting from Aachen, Germany via the network of the RWTH Aachen / DFN 2025-08-24 14:08:22 Is it hanging on TCP or HTTP(S)? 2025-08-24 14:10:01 HTTPS, it connects and sends off the entire request and then just doesn't get anything back 2025-08-24 14:10:23 https://mystb.in/21d37967addf27bfdc 2025-08-24 14:16:48 It's strange it only affects a single POP 2025-08-24 14:17:50 We use shielding, which means the request first goes to a specific POP before it reaches our serves. So in this case fra <-> ams <-> origin 2025-08-24 14:19:08 when i use http, not https, it goes through the same ams server but a different one in fra and works ok 2025-08-24 14:19:32 Interestingly enough, for me it almost always goes via London 2025-08-24 14:39:20 adrian: do you get routed often through fra, or occasionally? 2025-08-24 14:40:07 Hmm, just hit x-served-by: cache-ams21048-AMS, cache-fra-etou8220129-FRA and it wen through 2025-08-24 14:40:19 (a different pop in fra, but still) 2025-08-24 14:40:51 i always get routed through fra, i literally cannot make a single request via https to the repo :/ 2025-08-24 14:41:41 adrian: is it any POP in fra, or just that specific one? 2025-08-24 14:42:12 i always get that specific POP, when i use plain HTTP i get a different one and that one works 2025-08-24 14:42:42 hmm actually no 2025-08-24 14:43:04 ikke: I can probably take both of those for you, especially since they're related. 2025-08-24 14:43:07 i get different ones over HTTPS as well, but none of them seem tow ork 2025-08-24 14:44:00 adrian: I don't expect it, but does it make a difference if you disable http/2.0? 2025-08-24 14:44:40 adding --http1.1 to the curl invocation does not change anything, i still only see response headers and no body 2025-08-24 14:45:30 Does it work if you for example use nld-t1-1.alpinelinux.org instead of dl-cdn? 2025-08-24 14:47:04 yes, that host works 2025-08-24 14:47:13 and nld-t1-2.a.o? 2025-08-24 14:47:29 that one also works 2025-08-24 14:53:16 Just curious, if you do not get a response back, how do you get the X-Served-By headers? 2025-08-24 14:53:51 (The paste you provided does not include them) 2025-08-24 14:55:41 oh weird, i usually get the response headers 2025-08-24 14:56:08 you're right, sometimes I get them, sometimes I don't 2025-08-24 14:56:11 ok 2025-08-24 14:56:42 adrian: Can you do the request with `X-Fastly-Debug: 1`? 2025-08-24 14:56:49 See what information that returns? 2025-08-24 14:57:41 https://mystb.in/581b9806d2fd3b89fb 2025-08-24 14:58:18 Oh, sorry, without the X- prefix 2025-08-24 14:59:09 https://mystb.in/7d16e22e1ea3612d1b 2025-08-24 15:01:09 Can you just to be sure also verify with usa-t1-1, usa-t1-2, and ltu-t1-1 as hosts? (Just need to know if they work or not) 2025-08-24 15:01:47 those 3 all also work 2025-08-24 15:02:10 ok, good 2025-08-24 15:02:45 Oh, and nld-t1-2, last one 2025-08-24 15:03:02 Oh, you already tried that 2025-08-24 15:24:54 adrian: I've opened a case with fastly 2025-08-24 15:25:32 thanks, let me know if you need more info - i'll be away for a few hours now 2025-08-25 10:28:18 adrian: Are you able to get a packet dump for the request? 2025-08-25 15:33:16 load average: 100.29, 99.76, 85.59 2025-08-25 15:33:44 Oof. LLM scrapers? 2025-08-25 15:43:08 f_: Not sure, possibly triggered by some scraping 2025-08-26 01:31:40 ikke: here you go, https://mystb.in/faac33cb7556158c84 or is curl --trace not enough? 2025-08-26 01:42:17 https://files.postmarketos.cloud/curl.pcap here's a full packet dump in case that helps 2025-08-26 08:46:55 –Hello everyone ! o/ 2025-08-26 08:46:57 New contributor to Alpine here. I'm struggling with the registration in gitlab.alpinelinux.org 2025-08-26 08:46:59 I never receive the confirmation email when creating an account. 2025-08-26 08:47:01 My checklist so far: 2025-08-26 08:47:03 1. the mailbox works perfectly fine for other emails 2025-08-26 08:47:05 2. I whitelisted gitlab@gitlab.alpinelinux.org and gitlab@alpinelinux.org to circumvent spam filtering 2025-08-26 08:47:07 3. tried re-sending confirmation emails 24h later 2025-08-26 08:47:09 4. tried registering with an email from a different provider 2025-08-26 08:47:11 None of that worked. Did I miss something, or am I right to suspect there's an issue on the gitlab's side ? 2025-08-26 09:01:37 samaingw: I can check it later, but can you (privately if you want) share the username and email address? 2025-08-26 09:03:41 username: samaingw 2025-08-26 09:03:43 email: samain.gwen@laposte.net 2025-08-26 09:04:19 Thank you for your help ;) 2025-08-26 10:33:56 samaingw: It seems laposte.net is refusing the email. Searching around for LPN007_510, it sounds like they could block emails just because they contain URLs 2025-08-26 10:49:50 ikke: thank you again for your help ! I'll get in touch with the provider to sort things out 2025-08-26 14:16:20 how do i get access to dev.a.o? i'll likely have to upgrade mplayer and the snapshots are uploaded at dev.a.o 2025-08-26 15:42:15 achill: thats a manual process. ping me tomorrow if you dont get help today/tonight 2025-08-26 15:42:40 thanks 2025-08-26 19:22:30 I've created a user for achill 2025-08-27 05:42:05 adrian: Can it be there is something in your network causing the issues? The techs from fastly are unable to reproduce the issue. The pcap seems to be incomplete. I see a FIN,ACK packet, but no FIN packet for example.) 2025-08-27 08:35:41 In the past I have seen that fastly drops PMTU packets, so if the MTU is lower somewhere in the path, it may drop some packets 2025-08-27 16:24:46 oh lol just noticied the "latest development" commit list on a.o hasnt been updated since 2025-07-26 2025-08-27 16:32:42 Lazy devs :p 2025-08-27 16:42:15 fixed 2025-08-27 16:42:29 merci 2025-08-27 17:14:33 adrian: ncopa I does not appear to be an MTU issue. It seems to be able to transfer larger amounts of data from both sides 2025-08-27 17:17:19 ncopa: I've moved your x86_64 container to deu-t1-1.a.o. It's reachable via ncopa-edge-x86_64.deu-dev-1.alpin.pw 2025-08-27 17:39:22 cely: your container is moved as well, available via celeste-edge-x86_64.deu-dev-1.alpin.pw 2025-08-27 18:47:15 nmeum: Your x86* containers have been moved as well 2025-08-27 20:33:17 thanks! 2025-08-28 06:31:01 good morning! seems like gitlab is strugling 2025-08-28 06:31:13 yeah poor gitlab.. 2025-08-28 06:32:49 Restarting it now. Need to look into what's causing this 2025-08-28 10:31:47 deu2-dev1 is very unresponsive now 2025-08-28 10:33:14 Checking 2025-08-28 10:38:36 i see that it is running out of memory 2025-08-28 10:38:41 mem usage is at top 2025-08-28 10:38:44 and cpu too 2025-08-28 10:39:21 cpu usage may be related memory usage. if running out of mem go apps will run slow 2025-08-28 10:39:53 there were lots of git processes 2025-08-28 10:40:47 yes 2025-08-28 10:41:01 seems like you stopped the gitaly container 2025-08-28 10:41:08 correct 2025-08-28 10:41:51 My suspicion is storage is not holding up, causing these processes to take a long time, and new processes keep being spawned, causing more load 2025-08-28 10:42:01 nod 2025-08-28 10:42:14 where can i find the gitaly config? 2025-08-28 10:42:25 i think there is a knob for concurrent git processes 2025-08-28 10:42:41 /srv/compose/gitlab/storage/config/gitaly 2025-08-28 10:42:48 Yes, there is 2025-08-28 10:43:01 But, the documentation comes with a warning that it should be used with caution 2025-08-28 10:43:37 Maybe it's warrented in our case 2025-08-28 10:43:42 "Enabling limits on your environment should be done with caution and only in select circumstances, such as to protect against unexpected traffic. When reached, limits do result in disconnects that negatively impact users. For consistent and stable performance, you should first explore other options such as adjusting node specifications, and reviewing large repositories or 2025-08-28 10:43:44 workloads." 2025-08-28 10:44:52 i have some memory that I have seen a knob in the gitlab admin gui for concurrency 2025-08-28 10:44:58 Yes 2025-08-28 10:45:06 Or more about timeouts 2025-08-28 10:45:18 Starting the containers again 2025-08-28 10:50:45 maybe "Raw blob request rate limit per minute" 2025-08-28 10:51:15 https://gitlab.alpinelinux.org/admin/application_settings/network#js-git-lfs-limits-settings 2025-08-28 10:51:27 Authenticated Git LFS request rate limit 2025-08-28 10:51:27 Enable authenticated Git LFS request rate limit 2025-08-28 10:51:27 Helps reduce request volume (for example, from crawlers or abusive bots) 2025-08-28 10:51:35 We do not use LFS 2025-08-28 10:52:15 ok 2025-08-28 10:52:29 and git http rate limits? 2025-08-28 10:54:21 Trying first to find out what is causing these processes 2025-08-28 10:56:16 https://runbooks.gitlab.com/gitaly/git-high-cpu-and-memory-usage/ 2025-08-28 10:56:59 At the moment quite some upload-pack processes 2025-08-28 11:03:37 we have prometheus monitoring set up. do we have graphs somewhere from the collected data? 2025-08-28 11:05:57 maybe I should ask for advice in #gitlab 2025-08-28 11:06:03 on libera 2025-08-28 11:07:38 The channel is quite quiet 2025-08-28 11:07:44 I'm lurking there 2025-08-28 11:08:00 i noticed. you are the one responding to peoples questions :) 2025-08-28 11:08:44 :-) 2025-08-28 11:10:25 ncopa: the metrics endpoint is enabled, but when I check the endpoint, it says it's disabled and to enable it where it's already enabled 2025-08-28 11:11:03 Apparently I need to set prometheus_multiproc_dir 2025-08-28 11:15:34 you should mention that it's mainly git processes that seem to be causing the load 2025-08-28 11:28:46 asking chatgpt, which is more reponsive than #gitlab 2025-08-28 11:28:51 Traefik + go-away show massive network, so you’re getting hammered and Git (pack generation) is doing most of the work. 2025-08-28 11:34:10 we dont use fastly or cloudflare in front of our gitlab instance, right? 2025-08-28 11:52:14 no 2025-08-28 11:52:22 not sure that would help a lot 2025-08-28 12:01:18 we are getting hammered and I think fastly has DDoS protections 2025-08-28 12:01:33 and I believe they have protections for AI scrapers as well 2025-08-28 12:01:40 at least cloudflare has 2025-08-28 12:03:14 We should first identify the source, for example correlate specific requests to these git processes 2025-08-28 12:04:42 There's a lot of pack-object processes, I don't think that's due to normal web traffic 2025-08-28 12:18:03 I see lots of git_upload_pack requests for random IPs coming from a hetzner IP 2025-08-28 12:23:07 ncopa: The long running processes seem to be indeed triggered by clone / fetch requests over https 2025-08-28 12:23:12 so not normal web requests 2025-08-28 12:39:21 ncopa: I'm keeping an eye on tail -f -n100 log/gitlab/production_json.log | jq 'select(.action == "git_upload_pack") | {path: .path, remote_ip: .remote_ip, time: .time, correlation_id: .correlation_id}' 2025-08-28 12:39:37 if you look at the environment of the git process, you'll see a CORRELATION_ID that matches 2025-08-28 13:51:59 ncopa: git 2.49.0 apparently has some faster packing generation, but we are already running that version 2025-08-28 13:52:15 although, it may be opt-in 2025-08-28 14:32:29 I've blocked anonymous cloning from Hetzner IPs, that seems to have reduced the amount of clone operations 2025-08-28 14:32:43 now it's mostly cat-file processes, but the load is managable 2025-08-28 16:12:51 thanks for taking care of it 2025-08-28 16:13:14 as I understand, the pack-object processes came from many git clones? 2025-08-28 16:14:27 either git clone or git fetch 2025-08-28 16:17:04 ftr, it's these processes that are causing load. pack-object can be the result of other operations as well 2025-08-28 16:23:49 https://gitlab.alpinelinux.org/jarlungoodoo73 I wonder why this user created non-alpine repos 2025-08-28 16:24:19 Users do that some times 2025-08-28 16:25:43 https://gitlab.alpinelinux.org/Samg217 and this 2025-08-28 16:25:59 i wonder if they are trying to abuse the CI or something 2025-08-28 16:27:01 this one is almost 7GB https://gitlab.alpinelinux.org/admin/projects/jarlungoodoo73/docs-content 2025-08-28 16:30:00 sometimes they do 2025-08-28 16:30:03 abuse CI 2025-08-28 16:30:07 but haven't seen it recently 2025-08-28 16:30:58 ncopa: oof, that's bad 2025-08-28 16:31:03 crazy question, but does Alpine really need to run their own gitlab instance? I'm sure moving to github/gitlab.com has been discussed before, but maybe things have changed since the last discussion? 2025-08-28 16:31:21 iggy: pmos moved away from gitlab.com even 2025-08-28 16:31:36 We did use github before but chose to move away from that as well 2025-08-28 16:31:56 what was the reason that pmos moved away from gitlab.com? 2025-08-28 16:32:15 well, AI bots aren't going away in our lifetime, so just figured I'd throw that idea out there 2025-08-28 16:32:21 https://postmarketos.org/blog/2024/10/14/gitlab-migration/ 2025-08-28 16:32:43 originally why we chose to run our own instance was that we wanted all our infra run on alpine. eg dog fooding 2025-08-28 16:33:23 And run on open source as much as possible 2025-08-28 16:33:55 "Nowadays, users of gitlab.com are required to provide a valid phone number and credit card information when setting up an account." 2025-08-28 16:33:57 some of the reasons pmos mention for moving away are exactly options you're mentioning to fight the bots (i.e. CDN bot protection) 2025-08-28 16:35:11 anywho, didn't mean to distract too much, it just seems like you guys are spending a lot of time on the boring parts of infrastructure these days 2025-08-28 16:35:23 i mentioned our problem to my coworkers (k0s ppl) and the first thing they asked: do you use cloudflare in front to protect against DDoS? 2025-08-28 16:35:39 iggy: I think those are valid questions 2025-08-28 16:35:54 ncopa: I don't think CloudFlare would consider this a DDoS 2025-08-28 16:36:10 maybe not 2025-08-28 16:36:26 but they have protections against AI bots 2025-08-28 16:36:33 what we do with go-away 2025-08-28 16:36:36 does CF ddos protection from bot protection? I thought they just had "protection" 2025-08-28 16:36:37 This seems to be just git clients 2025-08-28 16:36:45 aha 2025-08-28 16:37:24 *does CF distinguish ddos protection... 2025-08-28 16:37:31 CF has some AI scraper protection 2025-08-28 16:37:41 and Fastly probably has too nowadays 2025-08-28 16:37:52 at least they write blog articles about it 2025-08-28 16:37:57 The way I implemented go-away is that normal users should notice very little about it 2025-08-28 16:38:10 CF means that every user has to solve a 'captcha' 2025-08-28 16:38:11 i think it works fairly well 2025-08-28 16:38:22 captcha is annoying 2025-08-28 16:39:19 i would be more than ok to outsource the gitlab instance 2025-08-28 16:39:59 but it does not look like we can 2025-08-28 16:40:15 other question is if we can do something with our setup 2025-08-28 16:40:52 maybe we can move gitaly to a dedicated server? 2025-08-28 16:40:59 Perhaps 2025-08-28 16:41:23 or move postgres to dedicated server (if that even makes sense) 2025-08-28 16:41:58 or move it to a bigger server 2025-08-28 16:42:23 or tweak settings (I'm trying to find where gitaly has the pack cache?) 2025-08-28 16:42:28 It does make sense to split things up, but it adds complexity 2025-08-28 16:42:51 moving to bigger server would be simpler probably 2025-08-28 16:45:12 Another option (but would probably require a lot of coordination) os to truncate aports 2025-08-28 16:45:28 I believe it's mainly aports that's causing so much load 2025-08-28 16:45:46 gitlab-gitlab-1 container is what uses most memory 2025-08-28 16:46:10 yeah 2025-08-28 16:46:24 the gitlab container hosts puma 2025-08-28 16:46:41 i kinda anticipated slowdown of aports tree over time 2025-08-28 16:46:56 which is why i tried to avoid many smaller files in aports repo 2025-08-28 16:47:11 for example the checksums are embedded in APKBUILD 2025-08-28 16:47:16 But the depth of history itself is also an issue 2025-08-28 16:47:21 otherwise e'd double the number of files in the tree 2025-08-28 16:47:34 269089 commits 2025-08-28 16:47:50 how does that compare to linux kernel tree? 2025-08-28 16:48:18 1.3M commits 2025-08-28 16:48:55 1322572 2025-08-28 16:49:08 so git should be able to handle this in theory 2025-08-28 16:49:34 yes, and git is handling it, but there are certain operations that are expensive and don't scale well 2025-08-28 16:49:46 like git clone 2025-08-28 16:50:30 Over time git is adding more feature to scale better 2025-08-28 16:58:41 btw ncopa is there any reason for checksums for local files like patches? i find them very useless and just take my time running checksum over and over :p 2025-08-28 16:59:49 i dunno if they are useless 2025-08-28 17:00:06 maybe they are 2025-08-28 17:00:36 on the other hand, it helps to detect if git add was forgotted in a commit 2025-08-28 17:01:07 so you dont unexpectedly push things without updates you thought you included 2025-08-28 17:04:20 I think ollieparanoid once tried to make that not be a requirement for local files anymore 2025-08-28 17:05:01 yeah true thats a arguments 2025-08-28 17:05:06 *argument 2025-08-29 00:33:43 ncopa: $source is enough to catch missing git add. I think the main blocker at the moment is that the CI only checks for changes of APKBUILD files and not all files in the directory.