2025-08-01 08:02:59 ikke: do you run something similar to unattended-upgrades on all the servers? and if not - how do you ensure that all the servers have latest security updates installed? 2025-08-01 20:07:02 usrhere: Not really. Most software runs in containers, so unettended-upgrades does little for that. I'm moving things more towards kubernetes, where it's easier to automatically redeploy things. I already run renovate to keep images up-to-date as much as possible 2025-08-01 20:09:49 thanks, makes sense 2025-08-03 00:40:21 Ariadne: ikke: another thing we can do to minimise abuse from matrix side is to require matrix rooms to be accessible only via "matrix space" 2025-08-03 00:40:41 generally most if not all those spammers don't know how to use spaces 2025-08-03 16:21:36 panekj: you can't do that with the portal channels 2025-08-03 16:21:50 I think that requires at least Admin for that? But only the bridgebot has Admin 2025-08-03 16:23:12 btw pmOS has a bot that does cross-room banning 2025-08-06 14:31:02 f_: thank you for destroying my dreams 2025-08-06 14:50:35 pj: you're welcome 2025-08-06 14:50:46 hit me up when you have other dreams you want me to destroy 2025-08-06 14:51:21 btw I'm missing context 2025-08-06 14:59:34 might have been a dream about appservice-irc 2025-08-06 14:59:39 which is weird. 2025-08-06 15:20:05 fwiw, i can't recall why +M and/or +R isn't set, but i think we could be spared a lot of nonsense by using them. then exempt (+e) the webirc hostmask 2025-08-06 15:21:30 i dunno if +e would be needed for the matrix bridge. probably 2025-08-06 15:21:53 because then they can't talk? 2025-08-06 15:23:24 matrix users through the bridge are actual irc users 2025-08-06 15:25:27 i understand; i was writing extemporaneously. 2025-08-06 15:26:40 We have only set those when there were larger spam waves 2025-08-06 15:34:27 is there any argument against them? (if webirc and matrix are exempt) 2025-08-06 15:46:05 well matrix spam still goes through so it's useless 2025-08-06 15:46:36 unless you mean for the skraito stuff? 2025-08-06 15:49:27 not useless for irc spam. channels that have +M/+R do get noticeably less. (it won't stop determined a-holes though) 2025-08-06 15:49:46 invoked: how much irc spam did go through lately? 2025-08-06 15:49:53 Apart from the skraito stuff 2025-08-06 15:50:13 i don't have those statistics 2025-08-06 15:50:42 I'm not looking for statistics, just "not many" or "a little bit" or "a huge lot" 2025-08-06 15:50:43 but #tor for instance sets +R and it makes a difference. the only spam i've seen there in a while has come from matrix. 2025-08-06 15:50:56 +R also is extremely annoying 2025-08-06 15:51:09 so is spam 2025-08-06 15:51:09 OFTC doesn't have sasl, and their nickserv AJOIN thing is broken 2025-08-06 15:51:15 it's more annoying than spam 2025-08-06 15:51:26 because it means I don't autojoin the channel 2025-08-06 15:51:27 i'm more partial to +M but not my house. 2025-08-06 15:51:40 and then it means I miss out on dev chat 2025-08-06 15:51:55 until I notice, "huh why is #alpine-devel not there?" 2025-08-06 15:52:12 I'd be for +M if needed 2025-08-06 15:52:33 but the skraito spam really isn't anything that can't be solved with some spam filtrs 2025-08-06 15:52:36 filters 2025-08-06 15:53:11 alongside +b *skraito*!*@* 2025-08-06 15:53:15 i don't think there will ever be a silver bullet. but the tradeoffs on some things are worth it, imo 2025-08-06 15:53:33 Whatever you do, please don't +R 2025-08-06 15:53:42 not up to me, just teeing the discussion 2025-08-06 15:54:02 I cannot autojoin +R channels because no SASL (and no, certfp does not fix that) 2025-08-06 15:54:24 and I'm sure I'm not alone 2025-08-06 15:54:34 +R or +M or whatever else means anyone on matrix can't talk so I don't understand the conversation 2025-08-06 15:54:35 works on tor. :-/ 2025-08-06 15:55:16 invoked: #tor you mean? I fail to autojoin that channel 90% of the time and have to remember to rejoin when I wake up and my bouncer reconnects at night 2025-08-06 15:55:27 if I ever notice my bouncer reconnecting, even 2025-08-06 15:55:50 i mean, i never have a problem autojoining over a tor connection. but then i'm not in hundreds of channels 2025-08-06 15:56:02 tor is a special case 2025-08-06 15:56:03 tor or not is irrelevant in this case 2025-08-06 15:56:11 I think? 2025-08-06 15:57:39 pj: well, ones registered via nickserv can 2025-08-06 15:57:49 but autologin is completely broken on the bridge 2025-08-06 15:57:57 yes, but who does that on matrix 2025-08-06 15:58:05 (except for me, achill and probably you) 2025-08-06 15:58:19 I don't use matrix as an IRC bouncer, so no I don't do that 2025-08-06 15:58:27 and you are not authenticated at the moment 2025-08-06 15:58:48 +e bridge hostmask should work, or am i high 2025-08-06 15:58:50 because bridge suck 2025-08-06 15:58:56 invoked: not for +M 2025-08-06 15:59:13 hm 2025-08-06 15:59:18 guess i'm high 2025-08-06 15:59:22 for +R ye 2025-08-06 15:59:26 +s 2025-08-06 15:59:35 but you should use +I instead of +e 2025-08-06 15:59:46 that's for +R as i understood 2025-08-06 15:59:52 i haven't looked at the docs 2025-08-06 15:59:56 [pj] user has identified and verified with services 2025-08-06 16:00:02 are you high, f_? 2025-08-06 16:00:12 panekj: probably 2025-08-06 16:00:39 but f_|mo is not authenticated (: 2025-08-06 16:00:47 neither is f_|pmOS 2025-08-06 16:00:51 or f_[m] 2025-08-06 16:01:06 well, yes, which is why I said "probably" 2025-08-06 16:01:22 and you said I'm not authed 2025-08-06 16:01:34 you weren't last time I checked ;) 2025-08-06 16:02:33 but yeah I don't use matrix for IRC (I use irssi for that), so I don't need to be authed to nickserv on these 2025-08-06 16:03:14 i thought you were using senpai, or did i confuse you with someone else 2025-08-06 16:03:20 I used senpai 2025-08-06 16:03:46 but then I went to try irssi again 2025-08-06 16:03:59 and then went slightly too far in customising? 2025-08-06 16:04:12 (so it doesn't even look like stock irssi anymore) 2025-08-06 16:04:16 as jess said, irssi is an old man 2025-08-06 16:04:22 :p 2025-08-06 16:05:45 come to weechat 2025-08-06 16:05:51 no 2025-08-06 16:05:52 we have 2025-08-06 16:05:53 it's slow 2025-08-06 16:05:54 uhh 2025-08-06 16:05:56 terrible scripts 2025-08-06 16:06:00 and weechat-matrix-rs 2025-08-06 16:06:09 weechat is slow 2025-08-06 16:06:23 less slow than element but still slow 2025-08-06 16:06:30 you are slow 2025-08-06 16:06:32 it works fine 2025-08-06 16:06:43 It was very slow when I had it join 500 channels 2025-08-06 16:07:06 and irssi is fine anyway 2025-08-06 16:07:12 it only locks up for 2 minutes to load whole Matrix HQ room :> 2025-08-06 16:07:22 I have, uhhhhhh, 10 perl scripts loaded 2025-08-06 16:07:47 i do think weechat is slow with a large # of channels. i remember hearing that before, not just from f_ 2025-08-06 16:07:49 which I left btw. but thanks to matrix:TM: I've been graciously force to experience the feeling of being part of that room 2025-08-06 16:08:26 I could use irssi-matrix 2025-08-06 16:08:29 but nah 2025-08-06 16:09:15 probably all offtopic here though 2025-08-06 16:09:44 as if this channel has any activity 2025-08-06 16:10:04 o/ 2025-08-06 16:10:06 well it's the space for algitbot to practise free speech 2025-08-06 16:10:17 we should not disturb that space :p 2025-08-06 16:10:18 algitbot: I SAID HELLO 2025-08-06 16:10:30 algitbot: you ok? 2025-08-06 16:10:39 #o 2025-08-06 16:10:42 \o 2025-08-06 16:10:45 \o/ 2025-08-06 16:10:50 smh 2025-08-06 16:11:21 though is algitbot connected via tls 2025-08-06 16:45:16 I don't think so 2025-08-06 16:45:58 It does not have tls support 2025-08-06 16:46:53 drats 2025-08-06 16:46:58 :> 2025-08-06 16:47:55 faq docker 2025-08-06 16:48:19 mhm 2025-08-06 16:53:15 ikke: uh, that's bad 2025-08-06 16:53:29 doesn't work with stunnel or similar? 2025-08-06 16:57:01 it would be good to import sircbot to gitlab or replace it with something else :> 2025-08-06 17:00:10 nah sircbot is nice 2025-08-06 17:00:17 with stunnel it makes for a nice companion 2025-08-06 17:01:18 other then that I have to restart it from time to time because it no longer responds to certain things 2025-08-06 17:04:49 strange never got that 2025-08-06 19:06:58 lotheac: I was trying the gitlab-runner deployment again, but before I did that, I updated the cluster. Now it seems (not sure if related to the upgrade or something else), the network is broken. Several services complain that 10.96.0.1:443 is not reachable, but not sure why yet 2025-08-07 01:03:50 ikke: that’s generally the api server’s internal address. problem with CNI? 2025-08-07 03:15:18 ikke: please let me know what you did to update so that i can try to repro it 2025-08-07 05:48:49 lotheac: to update it, I changed the k0s version to: "version: 1.33.3+k0s.0" 2025-08-07 05:50:07 Note that I have another clusted ugpraded to the same version without issues 2025-08-07 05:50:29 After upgrading, I could not get any logs for any pods, nor was the metric server returning any results 2025-08-07 05:51:16 After searching, I saw that that could be related to the api externalAddress, which I noticed was not set. So I set it, applied again 2025-08-07 05:52:23 did that help? 2025-08-07 05:53:01 The first problems, but then some workloads had the issue with contacting the api server internally 2025-08-07 05:54:49 I stopped the cluster, because 2 out of 3 controllers started using a lot of CPU (and spamming the logs a lot) due to those connection issue 2025-08-07 05:55:46 i would investigate the CNI containers/logs to see if there is any hints there. the apiserver being unavailable at the internal address is just the symptom of the internal network being broken 2025-08-07 05:56:20 i gotta do some paid work right now, so i'll get back to you later :) 2025-08-07 05:57:15 same for me 2025-08-07 11:42:44 lotheac: On one of the nodes: [ERROR][24126] cni-plugin/plugin.go 593: Final result of CNI DEL was an error. error=error getting ClusterInformation: Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0.1:443: connect: connection refused 2025-08-07 11:44:40 The calico-kube-controller is in a CrashLoopBackoff 2025-08-07 12:08:07 it seems like something of a chicken and egg that the CNI containers on any one node would crash if they cannot reach the apiserver through an address that a working CNI itself is supposed to provide :P 2025-08-07 12:08:19 so maybe that is not the root problem 2025-08-07 12:08:57 s/provide/make routable/ 2025-08-07 12:09:21 Yeah, I'm trying to figure out how the 10.96.0.1 address is supposed to be routed 2025-08-07 12:09:29 ECONNREFUSED sounds interesting though, so there _is_ something at that address 2025-08-07 12:09:43 because it responds with RST 2025-08-07 12:09:46 ahuh 2025-08-07 12:09:49 so it is routable, but... 2025-08-07 12:11:08 i don't recall how exactly k0s provides the apiserver, but i would assume that all the control plane nodes have _some_ process on them meant to listen for apiserver requests 2025-08-07 12:11:44 it could be iptables/nft too, of course 2025-08-07 12:12:39 tcp 0 0 :::6443 :::* LISTEN 11298/kube-apiserve 2025-08-07 12:13:02 But I'm not sure how 10.96.0.1 is routed 2025-08-07 12:13:21 the CNI is supposed to handle that part 2025-08-07 12:14:02 depending on what CNI, it usually does it with iptables or nft or ebpf and some coordination between nodes 2025-08-07 12:14:18 or userland proxies 2025-08-07 12:14:32 But what is listening on port 443? 2025-08-07 12:17:34 it's probably supposed to be the apiserver; cni's handle port redirection stuff too 2025-08-07 12:21:30 the fact that it listens to 6443 on the host addresses most likely matters very little 2025-08-07 12:22:28 unless of course connectivity to those host addresses is being blocked by other firewall rules :) 2025-08-07 12:22:53 There are no other firewall rules, everything is managed by k0s / kubernetes 2025-08-07 12:26:29 in that case, sounds like the CNI is not doing something it should be doing 2025-08-07 12:27:36 you're able to access the k8s api though? 2025-08-07 12:27:39 yes 2025-08-07 12:27:48 Externally everything seems to be working 2025-08-07 12:27:59 It seems internal routing is broken somehwow 2025-08-07 12:28:31 you could kubectl describe -n default svc/kubernetes and it will probably tell you that that svc is 10.96.0.1:443 just to verify 2025-08-07 12:29:02 but yeah it sounds like the pods cannot reach the internal network in that case 2025-08-07 12:29:14 which could mean that it is a problem with the container runtime 2025-08-07 12:30:28 try something like kubectl run --rm -ti --image=curlimages/curl:latest foo -- -v https://google.com/ 2025-08-07 12:30:34 I completely stopped the control plane 2025-08-07 12:30:59 that would tell you if the container network can reach external networks 2025-08-07 12:32:59 That command seems to hang 2025-08-07 12:33:11 like it does not even create the container 2025-08-07 12:33:26 can you kubectl describe pod/foo in another terminal to see what's going on 2025-08-07 12:34:01 This is svc/kubernetes: https://tpaste.us/8Bg8 2025-08-07 12:34:41 yeah, that's exactly what i expected: the service is being provided on 10.96.0.1:443, and the targerport (what the backend is actually listening on) is 6443 2025-08-07 12:34:47 targetport* 2025-08-07 12:35:11 lotheac: describing on the pod results in similar errors not being able to reach 10.96.0.1:443 2025-08-07 12:35:13 and the CNI is supposed to handle that redirection/forwarding 2025-08-07 12:35:23 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "248d06b3a4e8dbca7625cec572c03e6e9560e501e0eb5b3550fe46344625d725": plugin type="calico" failed (add): error getting ClusterInformation: Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0.1:443: connect: connection refused 2025-08-07 12:36:21 I did very the calico network itself is working (I can ping each node from each other node via the vxlan.calico interface 2025-08-07 12:36:37 ikke: ok, so maybe that means that the pod was scheduled to run on a different node than the apiserver... which of course it was, and then the apiserver you are contacting can't itself talk to the node it was scheduled on 2025-08-07 12:36:53 are there only controller+worker nodes in this cluster? or workers as well? 2025-08-07 12:37:10 workers as well, I'll taint them 2025-08-07 12:37:53 hm. or maybe pod creation would in any case contact the apiserver, not sure 2025-08-07 12:38:01 even less sure if it should happen via the internal addr like that 2025-08-07 12:38:34 because... the calico pods are supposed to be what provides that routing/connectivity 2025-08-07 12:38:38 (i think) 2025-08-07 12:39:55 how are you talking to the apiserver again? was it external lb? 2025-08-07 12:40:22 Yes 2025-08-07 12:40:24 does that lb show all its endpoints as healthy? 2025-08-07 12:40:32 it does 2025-08-07 12:41:30 The controller+worker nodes are ofcourse tainted themselves as well, so the test pod would never be scheduled there 2025-08-07 12:41:47 right 2025-08-07 12:41:52 it would need a toleration 2025-08-07 12:42:46 i was probably chasing the wrong thread there anyway, i think the error message is saying that the container runtime itself wants to contact the apiserver at 10.96.0.1:443 2025-08-07 12:43:40 on the controller nodes' host network, are you able to establish a tcp connection to 10.96.0.1:443? 2025-08-07 12:44:14 no 2025-08-07 12:44:20 "Could not connect to server" 2025-08-07 12:44:22 that _may_ be related 2025-08-07 12:44:27 but i'm not sure it is 2025-08-07 12:44:47 The host network has no route to 10.96.0.0/12 2025-08-07 12:45:40 ok, i think i need to set up my test k0s cluster again to see if i can get this issue to happen 2025-08-07 12:45:57 we're not getting anywhere like this :) 2025-08-07 12:47:34 I can try to switch it back to kube-router to see if everything on the control nodes is working 2025-08-07 12:48:37 wasn't there some blocker why we switched to calico to start with 2025-08-07 12:48:56 It would not work over dmpvn 2025-08-07 12:49:01 right 2025-08-07 12:53:38 It's a bit confusing as you say, calico itself is unable to start because it cannot reach that address 2025-08-07 12:53:47 cni-installer/install.go 478: Unable to create token for CNI kubeconfig error=Post "https://10.96.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-p 2025-08-07 12:53:49 lugin/token": dial tcp 10.96.0.1:443: connect: connection refused 2025-08-07 12:55:14 probably calico has multiple components and the thing that is relying on the apiserver being available is not the thing that is broken 2025-08-07 12:58:16 But I would expect the CNI installer not to depend on the CNI already running 2025-08-07 12:58:35 correct 2025-08-07 12:59:10 i installed a new cluster with 1.33.3+k0s.0 onto fresh linode nodes and it works fine... sigh 2025-08-07 12:59:21 ci-cplane-1-c0:~# nc -v 10.96.0.1 443 2025-08-07 12:59:21 10.96.0.1 (10.96.0.1:443) open 2025-08-07 12:59:32 hmm, interesting 2025-08-07 12:59:41 Let me reboot the nodes 2025-08-07 12:59:55 I did start and stop the cluster a couple of times 2025-08-07 13:01:37 i'm gonna try creating it at 1.33.1 and seeing if the upgrade breaks it 2025-08-07 13:03:18 What does `ip route get 10.96.0.1` return for you? 2025-08-07 13:03:49 i already destroyed that cluster, let me get back to you in a sec :D 2025-08-07 13:10:16 So I removed the api externalAddress, and calico-node is running again :/ 2025-08-07 13:11:06 But, I cannot get any logs of pods, nor is the metric service working 2025-08-07 13:11:10 I'm confused 2025-08-07 13:11:25 So all workloads are happy 2025-08-07 13:13:13 ok, so pod->internal net routing works, and so does lb->apiserver (otherwise you would not be able to see the pod statuses either), but maybe the apiserver pod cannot reach the kubelet api (running on each node) to get you the logs 2025-08-07 13:13:41 s/apiserver pod/apiserver/, might not be a pod in itself on k0s 2025-08-07 13:14:32 isn't, in fact 2025-08-07 13:14:53 ci-cplane-1-c2:~# ip route get 10.96.0.1 2025-08-07 13:14:53 10.96.0.1 via 172.233.50.1 dev eth0 src 172.233.50.200 uid 0 2025-08-07 13:14:53 cache 2025-08-07 13:15:09 this is a 1.33.1+k0s.0 upgraded to 1.33.3+k0s.0 cluster 2025-08-07 13:15:36 panic: unable to load configmap based request-header-client-ca-file: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 10.96.0.1:443: 2025-08-07 13:15:40 172.233.50.1 is the default gateway, so i think the routing table tells us exactly nothing 2025-08-07 13:15:46 yeah 2025-08-07 13:15:56 as in the CNI probably intercepts this before it actually leaves the machine 2025-08-07 13:16:29 nc at least says the ports are open now 2025-08-07 13:16:41 on the host 2025-08-07 13:17:33 seems like calico is (at least in this configuration) using iptables for its stuff, based on iptables -L or nft list ruleset 2025-08-07 13:21:23 fwiw... kubectl logs can fetch any pod's logs for me in this configuration without issue 2025-08-07 13:23:30 If you do nsenter -t -n 2025-08-07 13:23:34 can you ping 10.96.0.1? 2025-08-07 13:23:41 I used the metrics-server pod 2025-08-07 13:24:11 I cannot even ping the default gw 2025-08-07 13:24:25 but I can curl google.com 2025-08-07 13:24:48 ❯ kubectl run --overrides '{"spec":{"tolerations":[{"operator":"Exists"}]}}' --rm -ti --image=curlimages/curl:latest foo -- -v https://google.com/ 2025-08-07 13:24:50 worked fine 2025-08-07 13:24:57 let me see about metrics-server 2025-08-07 13:25:36 ok, this works: nc -v 10.96.0.1 443 2025-08-07 13:26:07 yeah, ping doesn't work for me either 2025-08-07 13:26:12 https://termbin.com/g98l 2025-08-07 13:28:19 Error from server: Get "https://172.16.250.10:10250/containerLogs/kube-system/metrics-server-5f45c7b665-qvg9k/metrics-server": No agent available 2025-08-07 13:28:52 So probably an issue with konnectivity-agent 2025-08-07 13:29:43 FYI, that's the dmvpn gre address of the node 2025-08-07 13:30:43 right, 10250 is the kubelet api port 2025-08-07 13:31:46 i have no idea what "No agent available" means though 2025-08-07 13:34:02 based on some searching i am assuming that's an error message actually returned from the http api (as opposed to some errno that happened while trying to establish a connection) and it indeed seems related to konnectivity 2025-08-07 13:34:17 so are the konnectivity-agent pods on each node ok? 2025-08-07 13:34:19 And now suddenly it's working 2025-08-07 13:35:00 yes, they are all running, and just a single line in the logs 2025-08-07 13:35:00 ha :D sounds like kube networking alright 2025-08-07 13:35:10 1 clientset.go:285] change detected in proxy server count (was: 0, now: 3, source: "KNP server response headers") 2025-08-07 13:36:23 i searched for that string and found only https://irclogs.alpinelinux.org/%23alpine-infra-2025-06.log 2025-08-07 13:37:20 typical 2025-08-07 13:37:44 fun :p 2025-08-07 13:37:45 restarting k9s (which I use to inspect the state) and no longer working 2025-08-07 13:37:57 So I assume it has to do with which node I end up with through the LB 2025-08-07 13:39:12 https://docs.k0sproject.io/head/high-availability/ 2025-08-07 13:39:29 It does mention the api externalAddress 2025-08-07 13:39:42 you could verify that hypothesis by port-forwarding and trying to talk to each node individually... but by "no longer working" what do you mean as the failure mode 2025-08-07 13:40:06 metrics not available, not able to obtain logs 2025-08-07 13:40:41 ah as in the apiserver itself responds and is healthy, but is not able to reach the kubelet on the node that metrics-server is running on 2025-08-07 13:41:09 yeah you'd have about a one in three chance of hitting the same node that it's on, and if it doesn't work otherwise... 2025-08-07 13:42:12 Error from server (BadRequest): previous terminated container "metrics-server" in pod "metrics-server-5f45c7b665-qvg9k" not found 2025-08-07 13:42:18 Error from server: Get "https://172.16.250.10:10250/containerLogs/kube-system/metrics-server-5f45c7b665-qvg9k/metrics-server?previous=true": No agent available 2025-08-07 13:42:21 'kubectl get node -o wide' should tell you what the node internal addresses are considered to be -- and then you could try to see if you can reach them from another node using those addresses on :10250 2025-08-07 13:44:24 The dmvpn gre addresses, I can curl to each node (get a 404 back) 2025-08-07 13:44:24 maybe the in-container network cannot route to the gre internal addresses (or return traffic of that is not routed properly) 2025-08-07 13:48:37 my metrics-server is at pod ip 10.244.63.129 on the c0 node... and routable from c0 itself and c2 host netns https://termbin.com/k9cm 2025-08-07 13:50:14 trying to find the logs for konnectivity-server 2025-08-07 13:50:17 and conversely the pod network can reach either node https://termbin.com/7az4f 2025-08-07 13:51:52 https://tpaste.us/r5mL 2025-08-07 13:52:29 likely a symptom rather than a cause 2025-08-07 13:52:31 yes 2025-08-07 13:57:58 checking all the config 2025-08-07 13:58:43 This seems to be correct: https://tpaste.us/PQPx 2025-08-07 14:00:55 dial tcp 172.235.190.129:6443: connect: connection refused 2025-08-07 14:01:05 That seems to connect to the external ip 2025-08-07 14:01:21 my current thinking is: we are observing a failure with something from container netns trying to talk to a host dmvpn address (that or the return traffic of such) 2025-08-07 14:01:55 _different_ host, that is 2025-08-07 14:03:17 so i don't really know how dmvpn does its magic, but since calico is doing its own stuff with iptables... maybe there's some kind of conflict or assumption broken there 2025-08-07 14:03:39 what i'm confused by is, how did this start happening with the upgrade? 2025-08-07 14:07:06 https://termbin.com/1ust that's the entirety of the host nftables in a cluster _without_ dmvpn, maybe it's different with it - don't know 2025-08-07 14:07:13 i gotta head to sleep in a bit 2025-08-07 14:08:05 dmvpn does not set any firewall rules itself 2025-08-07 14:08:10 It's pure routing 2025-08-07 14:08:25 okay 2025-08-07 14:13:45 so if a container netns wants to talk to a dmvpn address of a different node... the host on which that container is running is expected to forward that traffic based on its routing tables 2025-08-07 14:14:29 yes 2025-08-07 14:14:37 and then the target node is supposed to be able to route the return traffic symmetrically 2025-08-07 14:15:37 based on what we've seen the node->pod ip traffic is relying on iptables instead of routing tables 2025-08-07 14:16:28 can you curl a pod ip address from a host different than one it is running on 2025-08-07 14:17:00 it'll be ECONNREFUSED anyway but 2025-08-07 14:17:10 if it times out we know something is wrong 2025-08-07 14:18:20 I can ping it 2025-08-07 14:18:58 curl: (7) Failed to connect to 10.244.157.149 port 80 after 1 ms: Could not connect to server 2025-08-07 14:19:00 curl 2025-08-07 14:19:09 No timeout 2025-08-07 14:19:16 so host->pod on different node routes fine 2025-08-07 14:19:34 but the other direction seems not to? 2025-08-07 14:20:24 from the netns of that pod, try tcp 10250 of the node that you just performed the experiment on 2025-08-07 14:21:20 i'm expecting this to fail, since that's supposed to be the problem we saw, but just sanity checking 2025-08-07 14:23:39 curl https://172.16.250.10:10250 -k -> 404 2025-08-07 14:23:47 from c2 -> c0 2025-08-07 14:23:54 namespace of the konnectivity-agent pod 2025-08-07 14:24:26 well now i'm confused, that means everything works in both directions 2025-08-07 14:24:32 curl https://72.235.190.129:6443 2025-08-07 14:24:34 this times out 2025-08-07 14:24:39 wait 2025-08-07 14:24:41 wrong ip 2025-08-07 14:24:55 ok, so that works as well 2025-08-07 14:25:01 curl https://172.235.190.129:6443 -k 2025-08-07 14:26:00 hmm... ah but i might have been barking up the wrong tree the whole time here. the apiserver process is running in the host ns 2025-08-07 14:26:24 so... what doesn't seem to work is the apiserver->kubelet (different node) traffic, right? 2025-08-07 14:26:39 as in, can't get logs of pod on different node because :10250 not reached 2025-08-07 14:27:14 I think 10250 can be reached, but it responds that it cannot access the agent? 2025-08-07 14:27:26 right, i guess i got confused there as well 2025-08-07 14:27:30 https://docs.k0sproject.io/v1.21.0+k0s.0/networking/ 2025-08-07 14:27:45 i don't really know what the konnectivity agent is supposed to be doing :) 2025-08-07 14:28:07 It facilitates controller <-> worker communication 2025-08-07 14:28:40 "The Konnectivity service provides a TCP level proxy for the control plane to cluster communication." 2025-08-07 14:29:00 I think the agent is supposed to connect back to the control plane 2025-08-07 14:29:13 and then that connection is used to proxy requests 2025-08-07 14:30:00 that sounds wholly unnecessary in this scenario of two nodes who have routable ip addresses between them just trying to connect to a http api on 10250 on each other 2025-08-07 14:30:36 https://docs.k0sproject.io/stable/networking/#controller-worker-communication 2025-08-07 14:31:34 i see 2025-08-07 14:40:53 yeah, i have no idea what's going on. gotta sleep on it and try to figure out konnectivity later 2025-08-07 14:41:14 Trying to see if I can get help from the k0s project 2025-08-07 14:42:16 based on the upstream k8s docs https://kubernetes.io/docs/tasks/extend-kubernetes/setup-konnectivity/ the agents are supposed to connect to a proxy server running on one of the cplane nodes, but i couldn't see any "server" pod for konnectivity in the k0s setup, just agents 2025-08-07 14:42:52 I think the server runs directly on the host 2025-08-07 14:42:57 (that might just be because the server part is provided by each control plane node and not actually shown as a pod, since that's how k0s generally operates) 2025-08-07 14:42:59 yes 2025-08-07 14:43:56 and there's the error string, but i didn't dig into potential causes https://github.com/search?q=org%3Ak0sproject+%22no+agent+available%22&type=code 2025-08-07 14:44:21 It seems to run on all control-plane nodes btw 2025-08-07 14:44:30 anyway... good luck, and later :) 2025-08-07 14:44:42 Thanks for your help 2025-08-07 15:23:58 lotheac: ok, really strange. I added externalAddress now again, and now everything seems to be working 🤨 2025-08-07 15:24:14 (except for konnecitivity-agent on one of the worker nodes 2025-08-07 15:24:16 ) 2025-08-07 18:20:14 Interesting DNS was not working properly. Apparently one of the dns pods was located on one of the workers 2025-08-07 18:20:48 After adding a nodeAffinity for the control plane nodes, dns is working a lot better 2025-08-07 19:21:29 The question then remains why requests to the worker node were not working 2025-08-08 00:00:35 hmm :| 2025-08-08 00:00:57 that sounds weird, the dns pods should work just as well no matter where they are scheduled 2025-08-08 00:01:13 which makes me think there is something still wrong with the network 2025-08-08 13:57:22 just noticed that on the beginning of a x86 ci log: [0K on shared-runner x86.ci.alpinelinuix.org (x86) -GMdM8s8, system ID: r_kvex70My1Hwh[0;m 2025-08-08 13:57:28 the hostname is probably a typo 2025-08-08 17:14:25 fyi: I broke my bananapi bpi-f3 with kernel update, so now one riscv64 runner is down 2025-08-08 17:36:40 and its back 2025-08-08 17:56:29 ok, im testing boot the linux-spacemit kernel from sdcard. I now get this error when booting: 2025-08-08 17:56:46 Retrieving file: /boot/initramfs-spacemit 2025-08-08 17:56:46 Retrieving file: /boot/vmlinuz-spacemit 2025-08-08 17:56:46 append: earlycon=sbi rw root=UUID=16878291-386c-4fd9-a6b0-2d54f341864b rootfstype=ext4 rootwait console=ttyS0,115200 clk_ignore_unused swiotlb=65536 2025-08-08 17:56:46 kernel_comp_addr_r or kernel_comp_size is not provided! 2025-08-08 17:56:46 Retrieving file: /boot/dtbs-spacemit/spacemit/k1-bananapi-f3.dtb 2025-08-11 07:34:49 Gitlab has been upgraded to 18.1 2025-08-11 07:51:35 ERROR: Job failed: failed to pull image "registry.alpinelinux.org/alpine/infra/docker/alpine-gitlab-ci:latest" with specified policies [always]: Error response from daemon: unknown: 404 page not found (manager.go:250:0s) 2025-08-11 09:07:14 can someone please ping me when ^ is fixed, so I can re-run CI for !88539 and !88540 2025-08-11 09:08:07 Or just someone else with permission can re-run them, I don't mind 2025-08-11 09:24:47 wait a while. GitLab was just upgraded to 18.1 and seems unstable—the manual trigger is still on 17.2.1 2025-08-11 09:26:03 Ok, need to run migrations for the registry 2025-08-11 09:28:38 Kladky: should work again 2025-08-11 09:29:29 Thanks 2025-08-12 16:50:21 durrendal: I've used your draft for deploying alpine-mirror-sync and created a playbook with several roles (and some tweaks): https://gitlab.alpinelinux.org/alpine/infra/ansible-playbooks/-/merge_requests/1 2025-08-12 17:44:25 ikke: these look like fantastic adjustments! I'm glad it worked and you were able to so easily extend it. 2025-08-12 17:45:14 I've used this to deploy 2 new servers :) 2025-08-12 17:45:22 https://ltu-t1-1.alpinelinux.org/ 2025-08-12 17:45:27 They're syncing now 2025-08-12 18:03:46 :D love it! It feels really good to have contributed to that! 2025-08-12 18:05:00 Super curious how long it generally takes to fully sync, if you know roughly. I only did dry runs when testing, didn't have the disk space to sync everything 2025-08-12 18:05:21 Really depends on the bandwidth, but roughly one day 2025-08-12 18:12:45 Yeah it's a super fuzzy question, but a day is better than I thought. 2025-08-12 18:13:11 Are the two new mirrors part of the backbone that feeds into the CDN? 2025-08-12 18:13:33 Yeah, they will be 2025-08-12 21:23:41 durrendal: i think last time when i synced it took much longer than a day 2025-08-12 21:23:49 also depending from where you sync ofc 2025-08-12 21:23:58 our master is not that speedy 2025-08-13 09:47:22 I wonder if we want another slightly better CI runner for riscv64? 2025-08-13 09:47:32 I'm thinking of https://pine64.com/product/alpha-one-7b-llm-agentic-generative-ai-agent-eu-version/ 2025-08-13 09:48:42 similar hardware as hifive premiere p550. The CPU is one of the faster ones on the market currently 2025-08-13 09:48:49 and it has 32G ram 2025-08-13 09:50:19