2025-06-01 07:00:46 lotheac: it appears that if you use the internal loadbalancer 2025-06-01 07:01:44 You can still reach the individual endpoints. So we could probably firewall off everything except the api server? 2025-06-01 07:30:52 yes, possible 2025-06-01 07:31:08 even with a public lb 2025-06-01 19:11:23 lotheac: But if we use the public lb, we need to filter by public IP address that's allowed to access it, right? 2025-06-01 19:11:54 If we use internal loadbalancing, and then only expose the api through the loadbalancer, we could just keep the other stuff internal, right? 2025-06-01 22:47:28 ikke: right. that's probably the better option 2025-06-02 06:47:44 the ppc64le CI runner has some weird issues: cd: line 192: can't cd to /builds/alpine/aports: No such file or directory 2025-06-02 06:47:56 https://gitlab.alpinelinux.org/alpine/aports/-/jobs/1879841 2025-06-02 10:12:25 iirc alpinelinux uses zabbix, correct? 2025-06-02 10:12:44 how is it? 2025-06-02 10:16:03 its good 2025-06-02 10:34:05 ncopa: nice, is it lightweight? 2025-06-02 10:35:43 It can run on an rpi 2025-06-02 10:35:53 like an rpi2 2025-06-02 10:36:21 seems like what I want 2025-06-02 10:36:57 I've a server running alpine with a bunch of containers (also running alpine), and it's a pretty tiny server (single core, 2 GB RAM) 2025-06-02 10:37:09 and I've been looking for some monitoring solution 2025-06-02 15:12:45 f_: are you looking for something that just works or something that could be resume fodder? 2025-06-02 15:19:43 iggy: that mostly just works while being very extensible 2025-06-02 15:20:08 I'm currently trying out prometheus but will consider zabbix if it turns out prometheus is not what I want 2025-06-02 15:28:00 f_: rgr, yeah, I was going to say, there's probably more companies actively using prometheus/victoria metrics/loki/mimir/etc, but it can be a bit unwieldy... there's also signoz which I've heard is shiny 2025-06-02 17:04:22 iggy: well basically what I'm looking for is something that can do simple metrics and that can throw stuff on an IRC channel 2025-06-02 17:04:38 and that works on alpinelinux (obviously) 2025-06-02 17:05:07 Prometheus + alertmanager + alertmanager-irc-relay seems to be working okay for that 2025-06-02 17:05:39 (I set it up in the end, and also packaged alertmanager-irc-relay this morning) 2025-06-03 05:22:52 sent an email regarding the ppc64le builder, it has storage issues 2025-06-03 08:55:11 is it on only fsck? 2025-06-03 08:55:34 I mean filesystem errors due to hard reset, or i it problems with the physical storage 2025-06-03 08:55:57 i think fsck *should* run on reboot, but it might be that it is not enabled 2025-06-03 09:32:34 rebooting gitlab? 2025-06-03 09:32:55 HTTP 502: Waiting for GitLab to boot 2025-06-03 09:34:45 Nope, not me 2025-06-03 14:01:55 ikke: for go-away, are you using a package for it or did you just build it from source and use that 2025-06-03 14:02:11 wondering if it makes sense to package it given I also want to deploy it :p 2025-06-03 14:02:27 (and having it packaged makes it less painful) 2025-06-03 14:02:29 f_: I used the upstream docker image 2025-06-03 14:02:46 ok 2025-06-03 14:03:10 But maybe packaging it would be good 2025-06-03 14:03:15 on it then :) 2025-06-03 14:11:08 really, the advantage of running alpine on the server is that now I get to submit and maintain APKBUILDs for things I need :) 2025-06-03 14:11:28 or more than I would if I were not using it a lot 2025-06-03 14:14:43 That's the best reason to package things 2025-06-03 14:16:29 Absolutely 2025-06-03 14:30:20 especially given I mostly like to deal with system containers rather than app containers 2025-06-03 14:43:23 I now have !85166 if interested 2025-06-03 15:32:17 3.21 and older package repos seem not to have updated today, is that related to the ppc64le borkage? 2025-06-03 15:33:20 I would not expect that 2025-06-03 15:35:31 The packages seem to have not been built 2025-06-03 15:35:36 I assume you refer to libarchive 3.8.1 2025-06-03 15:35:38 right 2025-06-03 15:36:32 only built for 3.22 and edge it seems 2025-06-03 15:36:51 https://build.alpinelinux.org/ "pulling git" 2025-06-04 05:16:11 qaqland: I ran git commit-graph write --changed-paths on my personal aports fork, and it does appear to have a huge inpact on things like git log 2025-06-04 05:16:19 (local repo, not gitlab) 2025-06-04 09:06:18 ikke: yes, but it takes a lot of time to index 2025-06-04 10:36:25 .. 2025-06-04 10:42:26 gitlab.a.o isn't fronted by anubis or go-away? 2025-06-04 10:44:16 omni: not yet 2025-06-04 10:45:26 15k requests where each IP address just makes a single request 2025-06-04 15:45:36 durrendal: fyi, equinix gave a timeline of what they expect 2025-06-04 15:45:55 goal is within 3 months and at most, 6 months 2025-06-04 15:47:35 that should be plenty of time I imagine. I started working on a playbook a couple of weeks ago but got pulled away due to a work project 2025-06-04 15:47:37 https://gitlab.alpinelinux.org/durrendal/deploy-mirror 2025-06-04 15:48:02 fortunately that wrapped up this past Monday, and I just caught up with my packages. So I can actually focus on this :) 2025-06-04 15:51:13 Great 2025-06-04 15:54:48 My plan was to get a playbook together with the existing docker-compose deployment, validate it mocks the current functionality, and then see where we can iterate/improve from there 2025-06-04 15:57:52 sounds good 2025-06-04 16:49:08 With regards to https://gitlab.alpinelinux.org/alpine/infra/infra/-/issues/10851, would it be possible to exclude the RSS feeds, just like you did when applying the rate limit (at least that is what i remember from the previous discussion on this channel) 2025-06-04 16:50:51 cely: In fact, I'm already gathering a list of user agents (in a confidential comment) 2025-06-04 16:51:43 Any in particular you care about? 2025-06-04 16:55:02 Yes, glab and newsraft 2025-06-04 16:56:33 The API would be excluded anyway 2025-06-04 16:58:10 Ok, so that leaves newsraft, which uses a "newsraft/(Linux)" User Agent 2025-06-04 16:58:48 Err, that didn't get sent correctly 2025-06-04 16:59:10 "newsraft/[version] (Linux)" 2025-06-04 16:59:52 I'll try to exclude the rss / atom feeds as well, so that shouldn't come up anyway, but I've added it to the list 2025-06-04 17:01:03 Thanks 2025-06-04 18:05:21 lotheac: do you have experience with nftables? 2025-06-04 20:38:55 kunkku: I'm trying to setup dmvpn on a server, but it never seems to get up. The special thing about this server is that it has a private ip address as a subinterface inet 192.168.138.45/17 scope global eth0:1. In /var/log/messages, I see it makes requests from this private IP address: 08[NET] sending packet: from 192.168.138.45[500] to 172.105.69.172[500] (340 bytes). 2025-06-04 20:39:06 03[IKE] giving up after 5 retransmits 2025-06-04 20:40:07 Is that the cause of the issue, and if so, any way around it? 2025-06-05 00:58:18 ikke: yeah I do, what's up 2025-06-05 08:01:44 are the CI builders based on the respective targeted stable branch? (and edge for most cases) 2025-06-05 08:03:38 oh, you see that in the build log 2025-06-05 08:18:31 omni: right now, it downgrades to the release it builds for 2025-06-05 09:13:41 ok 2025-06-05 09:13:49 ppc64le package builders are not yet running? 2025-06-05 10:23:10 omni: should be runnign 2025-06-05 10:45:54 ok, I only see "pulling git" on build.a.o 2025-06-05 11:47:03 🤨 2025-06-05 12:05:04 looks like it's building for edge now 2025-06-05 12:10:26 3.21 as well 2025-06-05 12:41:27 all ppc64le builders (down to 3.19) have been fixed 2025-06-05 18:28:34 lotheac: I just thought about a flaw in my approach (internal LB + exposing API with external LB): worker nodes do not know to connect to the external LB address (since you cannot set externalAddress) when setting nodeLocalLoadbalancing 2025-06-05 18:42:05 I did not get dmvpn to work in this setup yet 2025-06-05 18:42:28 I suspect the private address subinterface is causing issues 2025-06-05 19:54:51 " The konnectivity agents rely on the load balancer to eventually provide connections to all controllers. The LB address is used to brute-force open connections until the agent has the desired number of connections to different controller nodes." 2025-06-05 20:01:03 Ok, now everything seems to work 2025-06-05 20:08:23 https://gitlab.alpinelinux.org/alpine/infra/k8s/ci-cplane-1/-/merge_requests/1 2025-06-06 01:31:10 ah, didn't realize you were intending to use both kinds of lb's :) 2025-06-06 01:45:40 left a few comments on your pR 2025-06-06 01:45:42 MR 2025-06-06 12:47:24 ikke: your MR works well, but wish when missing, the process return value is not zero 2025-06-06 12:49:20 another question is, how should we properly handle it if a missing is found? 2025-06-06 12:53:22 qaqland: context? 2025-06-06 12:53:54 sorry, it is https://gitlab.alpinelinux.org/alpine/infra/repo-tools/-/merge_requests/1 2025-06-06 12:54:59 qaqland: ah, yes, I can add the exit code 2025-06-06 18:29:13 Should the ppc64le builder be working normally by now? Dince I get some wierd lto failure when I try to compile apk-tools (v3) on ppc64le. 2025-06-06 18:31:10 sertonix[m]: I do expect everything to be back normal again, but maybe there are some remnants left (I had to recreate the git index of each builder for example) 2025-06-06 18:49:23 *Since 2025-06-06 19:26:39 sertonix[m]: Where do you get issues, and what issue exactly? 2025-06-06 19:45:17 Here are the CI logs: https://gitlab.alpinelinux.org/sertonix/apk-tools/-/jobs/1886755 2025-06-06 19:45:17 Not sure how to describe that 2025-06-06 19:51:33 Superficially it appears like a build configuration error 2025-06-07 11:51:32 ikke: your options on the new DMVPN setup are: 2025-06-07 11:52:26 (1) make sure the private IP has the secondary flag set (and the public IP does not) 2025-06-07 11:53:11 (2) in /etc/network/interfaces, add "tunnel-local " below the GRE interface 2025-06-07 11:53:42 kunkku: ok, let me test that 2025-06-07 12:03:42 kunkku: tunnel-local works 2025-06-07 12:04:03 (I did not try the other one, since that's technically outside of my control 2025-06-07 12:04:59 maybe setup-dmvpn could be improved to better handle multiple IPs 2025-06-07 12:05:41 For context, this is on linode, where we enabled a private IP for a loadbalancer to connect to 2025-06-07 12:05:46 linode then sets it up like that 2025-06-07 12:07:05 nhrpd seems to prefer the address with the shortest prefix length 2025-06-07 12:07:42 So a /17 over a /24? 2025-06-07 12:07:47 yes 2025-06-07 12:08:22 okay, that would explain it 2025-06-07 12:19:11 kunkku: thanks btw :) 2025-06-07 12:19:44 np 2025-06-07 17:59:32 raspbeguy: I suppose you are not using gbr-app-4 anymore? 2025-06-07 19:18:36 Anyone care to test https://gitlab-test.alpinelinux.org/alpine, I've put go-away in front of it 2025-06-07 19:22:53 didn't get a go-away screen, but could log in and load the aports repo 2025-06-07 19:24:33 Yeah, I may have to tweak it further, but I have tried to make it as inobtrusive as possible 2025-06-07 19:24:49 If you are logged in, you should never see anything 2025-06-07 19:26:09 wasn't logged in when initially loading gitlab-test.a.o in a separate browser instance 2025-06-07 19:26:17 but works fine! 2025-06-08 06:15:23 rsync is borked 😧 2025-06-08 11:12:46 restarting gitlab 2025-06-08 11:35:19 ikke: weird pitch, but hear me out - maybe the alerts should be on a different channel than one meant for communication? 2025-06-08 11:37:17 i feel like algitbot is dominating this one 2025-06-08 12:36:01 lotheac: I think having alerts in here is useful, but we could filter out the less important ones. 2025-06-08 12:36:19 arg 2025-06-08 12:39:14 i'm just trying to say that alerts mixed with human communication is unhelpful 2025-06-08 12:39:36 imho it helps to have a different channel for those things 2025-06-08 12:41:16 the battle for removing nonactionable or not-useful alerts is never-ending :) 2025-06-08 12:41:38 it's a fine line, and that's why it's difficult 2025-06-08 12:42:14 and that's why it's important to distinguish it from human communication 2025-06-08 12:42:23 imo 2025-06-08 12:43:14 But it makes it also easier to completely ignore / discard. I think there's also value for people here to be aware of issues 2025-06-08 12:45:45 Having the alerts here for me is at least a stimulance to actually do something with them 2025-06-08 12:51:41 for someone who can't do anything about them, any time i see something happening on this channel, my client happily tells me there are new messages, but i never know if it's useless or not 2025-06-08 12:52:10 so if i can't do anything about them, it stands to reason i should ignore algitbot to participate in this channel 2025-06-08 12:52:27 which is probably also not the intention :) 2025-06-08 13:18:45 that all said, i defer to you; i haven't been here very long :) 2025-06-08 13:32:15 Trying to figure out why nginx is not consequently showing the IP from the X-Forwarded-For header. It used to work fine 2025-06-08 13:32:41 I do see some requests showing the original IP 2025-06-08 13:34:33 what's the configuration? you're logging $proxy_add_x_forwarded_for or so? 2025-06-08 13:46:32 set_real_ip_from 172.16.0.0/12; 2025-06-08 13:46:36 real_ip_header X-Forwarded-For; 2025-06-08 13:46:41 real_ip_recursive on; 2025-06-08 13:47:29 i'm unfamiliar with this 2025-06-08 13:47:42 https://nginx.org/en/docs/http/ngx_http_proxy_module.html i usually just use these 2025-06-08 13:48:13 Yes, but in this case, nginx needs to read the header, not set it to a forwarded proxy 2025-06-08 13:48:41 traefik -> go-away -> nginx 2025-06-08 13:49:56 $proxy_add_x_forwarded_for always reads the client header and appends to it 2025-06-08 13:50:24 i'm confused why the realip module is a thing at all and why it's necessary here 2025-06-08 13:51:53 Because nginx itself also should be able to read the X-Forwarded-For header and set the actual client ip header, not the ip of the server that sent to it 2025-06-08 13:51:54 oh, is it for ip-based access control? 2025-06-08 13:52:10 no 2025-06-08 13:52:16 Just access logging 2025-06-08 13:52:51 Well, it was before I used go-away also for rate limiting 2025-06-08 13:53:21 i thought it generally (including by default) logs the requesting client as a list of ip's as indicated by x-forwarded-for 2025-06-08 13:54:10 By default, it ignores the X-Forwarded-For header, because it can be easily spoofed 2025-06-08 13:54:33 i might be wrong about the default logging... but the idea is that no server can trust that header field at all anyway 2025-06-08 13:55:03 You can if it's set by a trusted source 2025-06-08 13:55:04 so you log both the address connected to you (and add that to any x-forwarded-for going upstream) as well as whatever they told you 2025-06-08 13:55:20 That's at least what you do with the real_ip module 2025-06-08 13:55:53 the realip module documentation doesn't say anything about what happens if there are multiple addresses in the header that you give to it 2025-06-08 13:56:08 which is more or less normal for x-forwarded-for 2025-06-08 13:56:16 https://serverfault.com/questions/314574/nginx-real-ip-header-and-x-forwarded-for-seems-wrong gives some details 2025-06-08 13:56:44 That's where the recursive directive comes into play 2025-06-08 13:56:54 ah i see 2025-06-08 14:01:15 if it's just for logging, why is the realip module necessary anyway? seems like we're trying to figure out a corner-case behavior in it 2025-06-08 14:12:53 anyway, i guess i see what it's supposed to be doing. if you added go-away in between, is it possible that go-away is connecting to nginx from outside of the ip range you specified in set_real_ip_from? 2025-06-08 14:17:38 failing that, maybe go-away is not doing what we expect with x-forwarded-for 2025-06-08 14:20:00 X-Forwarded-For: 140.211.x.y:0, 172.25.0.2 2025-06-08 14:20:06 that's what nginx receives 2025-06-08 14:22:14 :0 looks weird 2025-06-08 14:22:47 yeah 2025-06-08 14:25:06 From a different deployment with go-away pointing to nginx, it does work fine 2025-06-08 14:26:26 is there a traefik in that other deployment? 2025-06-08 14:26:33 yes 2025-06-08 14:26:43 so, no strange interaction there then i guess 2025-06-08 14:26:58 Not the :0 port though 2025-06-08 14:27:44 The go-away version different though, so maybe due to a change? 2025-06-08 14:27:52 maybe the :0 has always been there in this setup and something(tm) has always normalized it away, but go-away just does a string-copy 2025-06-08 14:27:55 speculating 2025-06-08 14:28:57 the real question is, why is it there anyway? there's not supposed to be any port specification in there anyway 2025-06-08 14:30:23 "The request header field value that contains an optional port is also used to replace the client port (1.11.0). The address and port should be specified according to RFC 3986. " 2025-06-08 14:31:02 can you link the reference? i'm reading https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/X-Forwarded-For and there's no such thing here 2025-06-08 14:31:25 it mentions "de-facto standard" :) 2025-06-08 14:31:29 https://nginx.org/en/docs/http/ngx_http_realip_module.html 2025-06-08 14:31:37 Yeah, it's never standardized 2025-06-08 14:31:51 Hence the X- prefix :P 2025-06-08 14:32:10 Let me test something 2025-06-08 14:32:19 not sure that we care about whether or not realip handles the port, but why is there a port to begin with? where is it added? 2025-06-08 14:33:23 It must be go-away that adds it 2025-06-08 14:33:31 Just checked, go-away does not receive it with a port 2025-06-08 14:34:11 sounds like we should patch it there then to make it work like it used to :-) 2025-06-08 14:35:27 Just checked if it mattered if I explicitly set --backend-ip-header 2025-06-08 14:35:40 I didn't do that in the first deployment, I did to it here, but no difference 2025-06-08 14:35:56 strange 2025-06-08 14:37:34 checked the code, they just use the same header as the --client-ip-header if you omit it 2025-06-08 14:39:10 in that case it would be traefik appending the :0, no? 2025-06-08 14:43:10 If I check the requests that go-away receives, I don't see it 2025-06-08 14:43:28 I guess it comes from here? https://git.gammaspectra.live/git/go-away/src/branch/master/utils/http.go#L146 2025-06-08 14:43:30 https://git.gammaspectra.live/git/go-away/src/branch/master/utils/http.go#L130 lib/challenge/data.go calls this, i think it parses addresses it sees and serializes them later, giving potential for that :0 thing 2025-06-08 14:43:42 heh :P 2025-06-08 14:43:52 :-D 2025-06-08 14:45:51 yeah, seems like it was introduced in v0.6, and the other deployment rusn v0.5.4 2025-06-08 14:46:30 https://git.gammaspectra.live/git/go-away/src/branch/master/lib/challenge/data.go#L356 i think the header values it sends upstream come from here, so might be in the impl of Addr() 2025-06-08 14:47:34 but that's netip.AddrPort, so stdlib stuff 2025-06-08 14:48:27 I see changes between v0.7 and master 2025-06-08 14:48:56 https://git.gammaspectra.live/git/go-away/src/tag/v0.7.0/lib/challenge/data.go 2025-06-08 14:49:26 https://git.gammaspectra.live/git/go-away/commit/aebbfa4eaa8eb6a0e8161ad9686fbc6154f3179c 2025-06-08 14:49:50 what version are we running? 2025-06-08 14:49:54 v0.7.0 2025-06-08 14:50:50 aebbfa4 definitely looks like it could have the effect of adding that :0 2025-06-08 14:50:59 but it's not in 0.7.0 2025-06-08 14:51:14 My though was the opposite 2025-06-08 14:51:22 "set client network address without original port on backend-ip-header option" 2025-06-08 14:53:20 true, commit-message wise that tracks, i just read the code and thought that the stringification of the netip.AddrPort might work differently 2025-06-08 14:54:16 https://pkg.go.dev/net/netip#AddrPort.Addr returns the IP only so, yeah 2025-06-08 14:54:31 your thought was right 2025-06-08 15:04:08 i'm thinking just backport aebbfa4 2025-06-08 18:39:09 It seems that Gitlab's "Delete source branch when merge request is accepted" option hasn't been working for some time, since I now have lots of merged branches piling up in my aports fork, despite me having that option enabled 2025-06-08 18:40:14 The oldest branch dating back to 5 April 2025 2025-06-08 19:28:32 Hmm interesting, haven't noticed it yet 2025-06-09 10:21:43 I got a GL sign-in notification where the IP was 172.25.0.2 2025-06-09 10:21:58 is this related to the above nginx issue? 2025-06-09 10:26:13 kunkku: yes 2025-06-09 11:07:15 GL has also sent me occasional notifications on algitbot not having push access to https://github.com/alpinelinux/dmvpn-tools.git 2025-06-09 11:08:03 I wonder if the GH repo settings should be reviewed 2025-06-09 11:11:25 Right 2025-06-09 11:23:46 :-/ 2025-06-09 12:35:52 ^ I hit into this 2025-06-09 12:36:12 https://gitlab.alpinelinux.org/funderscore/aports/-/jobs/1889646 "fatal: unable to create thread: Resource temporarily unavailable" 2025-06-09 12:36:47 or this might be a different runner.. sorry for the noise 2025-06-09 19:05:56 lotheac: deployed a new go-away image with the patch and now the client ip is correct again :-) 2025-06-09 19:06:18 https://gitlab.alpinelinux.org/alpine/infra/docker/go-away 2025-06-10 00:44:13 ikke: nice :) 2025-06-10 18:50:26 could the arm package builders be made to prefer IPv4? 2025-06-10 19:07:33 ikke, fyi go-away is in alpine testing now 2025-06-10 19:07:40 thanks to omni for merging the MR ^^ 2025-06-10 19:09:59 funderscore: nice. I'm building it myself right now because I need an unreleased patch on top of 0.7 2025-06-10 19:10:11 which patch? 2025-06-10 19:10:27 Or I suppose it is in the repo you linked earlier 2025-06-10 19:10:56 This one I suppose https://gitlab.alpinelinux.org/alpine/infra/docker/go-away/-/blob/master/patches/set-client-network-address-without-port.patch?ref_type=heads 2025-06-10 19:11:21 Yes 2025-06-10 19:11:27 alright ^^ 2025-06-10 19:13:11 hmm, it fails on our arm* builders, network issues 2025-06-10 19:14:38 ikke: hopefully the openrc init stuff goes upstream as well 2025-06-10 19:14:46 The OpenRC scripting 2025-06-10 19:15:13 and at some point the package moves to community hopefully :) 2025-06-10 19:15:25 I don't see why it can't 2025-06-10 19:15:39 It definitely can 2025-06-10 19:15:57 in the #go-away channel on Libera someone was interested in openrc start scripts (in gentoo) 2025-06-10 19:15:58 It has a huge impact on our traffic 2025-06-10 19:16:10 Ah, nice, wasn't aware there was an irc channel 2025-06-10 19:16:27 and bringing those to the repo was discussed a bit, they're definitely welcome 2025-06-10 19:16:38 yes there's #go-away on Libera.Chat 2025-06-11 11:51:45 ikke: do we dare apply version 93 of vastly config? its still a draft. I would appreciate if you could have a look at it and tell me when it is a good time to apply it. 2025-06-11 11:52:38 Ok, I'll have a look later 2025-06-11 15:21:17 Additional benefit of go-away, it seems we receive a lot less spam account creations 2025-06-11 15:42:13 good job! 2025-06-11 15:44:14 gitlab server load last 7 days: https://imgur.com/a/PwOkFXm 2025-06-11 16:35:38 Running into this regularly, always around 1GiB on that tar download, but I can't reproduce from my network https://gitlab.alpinelinux.org/selfisekai/aports/-/jobs/1892847#L4982 2025-06-11 16:38:51 loongarch64 having any problems? https://gitlab.alpinelinux.org/alpine/aports/-/jobs/1892903 "tests/init.test:save_userdata_compressed -> /usr/include/c++/14.2.0/backward/auto_ptr.h:202: std::auto_ptr< >::element_type* std::auto_ptr< >::operator->() const [with _Tp = {anonymous}::global_state; 2025-06-11 16:38:51 element_type = {anonymous}::global_state]: Assertion '_M_ptr != 0' failed." 2025-06-11 16:40:12 lnl: checking on the host 2025-06-11 16:41:55 lnl: hmm, manually downloading seems to succeed. How ofter does it happen? 2025-06-11 16:44:26 checking the other attempt for the same ... 2025-06-11 16:44:33 lnl: ipv4 vs ipv6 issue perhaps 2025-06-11 16:44:58 ipv6 is a lot faster 2025-06-11 16:45:15 ...yeah same message on the other attempt 2025-06-11 16:47:44 there's no c++ here so the issue must be with the various compression binaries? 2025-06-11 16:51:05 correction - it's not in the same place -- first one https://gitlab.alpinelinux.org/alpine/aports/-/jobs/1892894 tests/init.test:userdata_type -> /usr/include/c++/14.2.0/backward/auto_ptr.h:202: std::auto_ptr< >::element_type* std::auto_ptr< >::operator->() const [with _Tp = 2025-06-11 16:51:05 {anonymous}::global_state; element_type = {anonymous}::global_state]: Assertion '_M_ptr != 0' failed. 2025-06-11 16:55:09 lnl: So yes, I can reproduce it when downloading it via IPv4 2025-06-15 07:24:02 lotheac: I'm trying setting up the cplane cluster now as controller+worker and gre1 (dmvpn interface) as privateInterface, but it does not seem to work yet. It has problems communicating with the other nodes over dmvpn, even though on the host everything is reachable 2025-06-15 07:24:10 Have to continue troubleshooting later 2025-06-15 07:24:48 Error from server: Get "https://172.16.250.10:10250/containerLogs/kube-system/coredns-6c77c7d548-xbwgc/coredns": dial tcp 172.16.250.10:10250: i/o timeout 2025-06-15 09:32:13 hmm. that's the kubelet API it's trying to call. what is that log line from? trying to understand if it runs in the host netns or container 2025-06-15 09:32:54 if it's from a container netns, you'll need to have the host perform forwarding 2025-06-15 09:33:35 incidentally i'm at kubecon and related events from today to tuesday, so i might respond a bit slow :) 2025-06-15 09:34:01 i assume you can reach 172.16.250.10:10250 from the host netns? 2025-06-15 09:38:17 i don't know offhand how it works with k0s, but if the host is performing NAT for container-initiated traffic, i'd just check that it's allowed to forward from the container networks to the dmvpn ip's (and maybe a good idea to double check that the ip ranges don't clash as well) 2025-06-15 09:39:39 if it isn't doing NAT, as in, if the traffic is going from a container IP to kubelet on that dmvpn IP, then we'd need to make sure that the hosts know how to route the return traffic 2025-06-15 10:29:15 lotheac: that log was from k0s kubectl logs pod/coredns-6c77c7d548-xbwgc -n kube-system 2025-06-15 10:29:36 But konnectivity is also logging that when trying to connect to the api 2025-06-15 10:31:06 lotheac: from the host ns, I can do nc -v 172.16.250.10 10250 2025-06-15 10:31:43 from the ns from a pod, I can ping the local gre1 interface, but not throug the tunnel 2025-06-15 10:52:25 lotheac: ok, interesting. Checking with tcpdump, I do see the traffic reaching the other node, but the source address is the internal ns ip, so I suppose there's no route back 2025-06-15 10:52:32 So yes, what you mentioned 2025-06-15 11:55:02 if it isn’t doing nat, i think the pod (and maybe service) ip routes have to be added to the hosts manually 2025-06-15 11:56:18 not sure if there is any better way… but maybe the routes can be via localhost? :D 2025-06-15 11:56:34 not sure how this is supposed to work in k0s 2025-06-15 12:04:17 I recall that kuberouter expects to be able to do l2 traffic, but dmvpn only does l3 2025-06-15 12:07:12 i don’t see why l2 would be necessary 2025-06-15 12:08:47 this is a pod (container netns) trying to talk to (some, perhaps other) host netns and i would expect the host needs a route to respond on in all cases 2025-06-15 12:09:31 otherwise it would only work on the interface with the host default route and not the one you designated internal in k0s config, right? 2025-06-15 12:10:05 i mean, even in a case without dmvpn 2025-06-15 12:11:23 If it sends traffic from the pods internal ip address, you cannot expect it to work over an l3-only network without NAT 2025-06-15 12:11:25 although yeah it’s possible that it’s relying on iptables mangling by kuberouter instead of routing tables, but in either case - why would that require l2 2025-06-15 12:12:03 in kubeadm clusters i _believe_ (not sure) it does nat always 2025-06-15 12:12:21 What CNI does it use with kubadm? 2025-06-15 12:13:06 https://docs.k0sproject.io/v1.27.1+k0s.0/networking/ 2025-06-15 12:13:07 you get to choose 2025-06-15 12:13:22 i’m most familiar with cilium 2025-06-15 12:13:28 k0s has kuberouter and calico built-in 2025-06-15 12:14:18 right 2025-06-15 12:15:03 i wonder how this works without dmvpn 2025-06-15 12:15:15 I tried calico, and the worker nodes that are external seem to work, but konnectivity on the controller+worker nodes still have issues (didn't dig into it yet) 2025-06-15 12:16:25 I didn't add the pure worker nodes to dmvpn yet 2025-06-15 12:16:43 as in how does a node route traffic to a pod in another node if you have a non-dmvpn l2 network connecting the nodes? (even if they are both controllers) 2025-06-15 12:18:18 i mean that l2 network would not be the default route interface even in that case, why would the sender node do arp on that network if it had a dst addr it doesn’t know where to route 2025-06-15 12:19:44 and i would find it weird if they were doing something resembling proxyarp like that on the receiving node too :thinking: 2025-06-15 12:21:19 i could test this on tailscale - which is also l3-only - but it might not happen today 2025-06-15 12:21:43 https://www.kube-router.io/docs/how-it-works/ 2025-06-15 12:22:21 I don't know all the details, I do know that I had issues when I used linode vpc's, which got solved when I used vlan interfaces 2025-06-15 12:23:32 I don't have a lot of time today either 2025-06-15 12:27:25 wait, i guess i already did it on tailscale originally - the cluster worked fine, i just confused things in my head just now because that poc used a public lb 2025-06-15 12:28:42 but the pod->kubelet traffic must have been going over tailscale in that setup because i had no internal net between the nodes and port 10250 was firewalled closed 2025-06-15 12:29:14 i’ll recheck that when i get the chance though 2025-06-15 12:32:24 If I check traffic going over the vlan interface on another cluster, I see traffic going directly from pod network <-> pod network 2025-06-15 12:32:29 on different hosts 2025-06-15 12:34:53 from 10.244.5.x <-> 10.244.4.x. I do see routes for these on each node 2025-06-15 12:46:12 lotheac: So something is preventing these routes to be shared 2025-06-15 12:47:11 I switched back to kube-router. I see the local route on each node, but no routes for the other nodes 2025-06-15 13:41:05 that's weird 2025-06-15 13:43:31 the CNI can of course be doing things in different ways so it is not _necessarily_ a problem that you don't see routes to other nodes for the pod network 2025-06-15 13:44:10 eg. on cilium it looks like this with a pod network of 10.128.0.0/9: 2025-06-15 13:44:23 10.128.1.123 dev cilium_host proto kernel scope link 2025-06-15 13:44:23 10.128.3.0/24 via 10.128.1.123 dev cilium_host proto kernel src 10.128.1.123 mtu 1230 2025-06-15 13:44:23 10.128.1.0/24 via 10.128.1.123 dev cilium_host proto kernel src 10.128.1.123 2025-06-15 13:44:23 10.128.0.0/24 via 10.128.1.123 dev cilium_host proto kernel src 10.128.1.123 mtu 1230 2025-06-15 13:44:23 10.128.2.0/24 via 10.128.1.123 dev cilium_host proto kernel src 10.128.1.123 mtu 1230 2025-06-15 13:44:52 cilium may be doing ebpf shenanigans that are not visible to netfilter and least of all routing tables 2025-06-15 13:46:02 but i don't know how calico and kube-router have been implemented in this regard -- it's possible they use nft to do the dynamic part of "which node to send pod X's traffic to" without modifying routing tables 2025-06-15 13:47:25 cni config load failed: failed to load CNI config list file /etc/cni/net.d/10-kuberouter.conflist: error parsing configuration list: unexpected end of JSON input: invalid cni config: failed to load cni config\"" component=containerd stream=stderr 2025-06-15 13:49:11 jq does not have any issues reading those files 2025-06-15 13:49:49 if you just changed the cni, it's possible there are leftover files from the previous cni. (they might write to host paths) 2025-06-15 13:50:26 i'm not familiar how the APIs for CNI itself work 2025-06-15 13:51:03 I rebooted the VMs 2025-06-15 13:51:05 but i suppose since you changed _to_ kube-router, and that's the file being complained about, something is off :) 2025-06-15 13:51:25 host /etc/cni/net.d i would expect to persist over reboots though ;) 2025-06-15 13:51:33 yup 2025-06-15 13:52:21 There were some calico routes left over, so that's why I rebooted 2025-06-15 13:52:28 alright 2025-06-15 13:54:04 Let me try to manually add the missing routes, see if that helps 2025-06-15 14:12:08 ah: Failed to start node BGP server: failed to start BGP server due to: listen tcp4 172.16.250.11:179: bind: address already in use 2025-06-15 14:15:06 I can override the bgp port 2025-06-15 14:22:27 improvement, but net yet healthy 2025-06-15 14:24:58 i recreated the cluster from my `cluster-ci` repo with one change: controller+worker nodes instead of controller, and i guess i did not test it enough originally - sure enough, i can't get pod logs either, timeout to kubelet api (from the apiserver i assume) 2025-06-15 14:25:24 (this is on tailscale) 2025-06-15 14:27:42 clientset.go:234] "cannot connect once" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: lookup ci-cplane-1.alpinelinux.org: i/o timeout\"" 2025-06-15 14:27:44 although it works for pods on some other nodes *facepalm* 2025-06-15 14:27:52 It seems it cannot resolve dns 2025-06-15 14:31:25 was the bgp port collision because dmvpn also uses bgp? 2025-06-15 14:31:38 yes 2025-06-15 14:33:20 that makes sense. the dns thing doesn't so much - yet :) 2025-06-15 14:34:26 I tried with nsenter, but not sure if that's a good test because it'll probably still use the hosts resolv.conf 2025-06-15 14:35:20 right, that could be a problem. generally kube pods talk to coredns (which is also running as a pod) and have that in resolv.conf 2025-06-15 14:35:49 although generally they also should be able to connect to external ip's, so :) 2025-06-15 14:37:34 yeah, coredns is also failing, so I should probably look a that first 2025-06-15 14:37:45 if you are able to `kubectl exec` from somewhere, that's usually a good way to debug 2025-06-15 14:38:07 it will enter all the necessary kinds of namespaces 2025-06-15 14:38:12 problem is that these containers mostly are FROM scratch containers without any other files 2025-06-15 14:38:18 right 2025-06-15 14:38:41 there is also `kubectl debug` which can create a second container in the target pod with your desired image, which is good enough sometimes 2025-06-15 14:38:52 okay, good to know 2025-06-15 14:42:08 I found where the logs are stored on the host, which makes it easier to debug 2025-06-15 14:42:29 that it does :) 2025-06-15 14:46:13 okay, just for posterity... i messed up above when creating the cluster (forgot to update the lb address in k0sctl.conf before running k0sctl the first time). recreating the nodes and the cluster, everything came up fine with tailscale https://termbin.com/v0ed 2025-06-15 14:52:25 the routing tables look like this on host https://termbin.com/ffrz -- empirically, pod ip's appear to be in 10.244.0.0/24 (or 10.244.1.0/24 and 10.244.2.0/24 for the builtin kube-system stuff, plus tailscale private ip's for kube-proxy and kube-router pods) 2025-06-15 14:52:46 but crucially the hosts have those routes to the pod network ranges 2025-06-15 14:53:03 lotheac: since changing the bgp port, the routes appear now 2025-06-15 14:53:13 right, i figured :) just wanted to double check 2025-06-15 14:58:33 Ok, I guess what's happening now is that coredns is put on the nodes that are not connected to dmvpn yet 2025-06-15 14:58:56 not necessarily - i'm also having a problem with dns resolution on my test cluster 2025-06-15 14:59:05 ok 2025-06-15 14:59:29 "kubectl get -A pod -o wide" will tell you the pod ip's and which nodes they are scheduled on 2025-06-15 14:59:52 I use k9s, which also shows that 2025-06-15 15:03:21 can you check logs of one of the kube-router pods? 2025-06-15 15:03:41 looks like i'm hitting an iptables issue on this end 2025-06-15 15:05:18 E0615 15:01:44.938646 3343 network_policy_controller.go:287] Aborting sync. Failed to run iptables-save: failed to call iptables-save: exit status 3 (Error: meta sreg is not an immediate 2025-06-15 15:07:52 Don't have that in the logs 2025-06-15 15:08:42 okay. then it's possible that we have dns resolution issues for different reasons :) my pod-to-pod traffic doesn't work at all, apparently 2025-06-15 15:09:23 (which would explain why dns doesn't work -- coredns cannot be reached) 2025-06-15 15:10:43 so - do check if connecting the coredns nodes to dmvpn helps 2025-06-15 15:10:54 i gotta run for now 2025-06-15 15:11:18 back in 7 or 8 hours 2025-06-15 15:11:24 lotheac: o/ 2025-06-15 15:11:31 o/ 2025-06-15 22:57:55 ikke: any progress? 2025-06-16 04:50:30 lotheac: I was kinda thwarted by the fact that the worker nodes run debian (the provider did not support alpine (yet)), which makes dmvpn almost infeasable 2025-06-16 04:53:37 lotheac: backup plan is wg tunnels, which is also connected to dmvpn, but not redundant 2025-06-16 04:57:59 ah I see. well, the tunnels to worker nodes would not really need to be redundant 2025-06-16 04:58:13 assuming different tunnels for different workers 2025-06-16 04:58:17 yes 2025-06-16 04:58:26 But the relay is not redundant 2025-06-16 04:58:33 so if the relay is down, all the worker nodes are disconnected 2025-06-16 04:58:36 right, that just moves the spof 2025-06-16 04:59:26 although maybe we can (eventually) have workers sponsored by more than one provider, in which case not all of them would go down if that relay failed :-) 2025-06-16 04:59:41 (assuming dmvpn on those hypothetical other-provider workers) 2025-06-16 04:59:50 Yes, we do have multiple providers 2025-06-16 05:44:08 should we package dmvpn for debian? doesn’t need to be a publicly available package for this 2025-06-16 05:44:44 or is there some other obstacle that debian presents 2025-06-16 05:48:50 Not sure, but there are multiple components involved 2025-06-16 05:49:29 okay. i can have a look later in the week 2025-06-16 05:49:57 There's things like strongswan and quagga, both which have patches. Not sure how critical they are for dmpvn 2025-06-16 05:50:26 And then there's setup-dmvpn, which is very alpine (at least openrc) specific 2025-06-16 05:52:25 right right 2025-06-16 05:53:04 maybe wg makes more sense short term then 2025-06-16 05:53:41 especially if i understood correctly your implication that this provider might support alpine in the future 2025-06-16 05:55:31 I've just finished setting up wg 2025-06-16 05:55:53 The providewr didn't make any promises, but would look into it 2025-06-16 06:00:13 I'm AFK now 2025-06-16 06:43:59 ah, got it 2025-06-16 06:44:04 cheers, talk later :) 2025-06-16 08:47:32 morning! I am pretty sure kube-router needs the nodes to be in same L2 net 2025-06-16 08:48:50 So in that case calico would be a better option? 2025-06-16 09:30:10 calico can do vxlan which works over l3 2025-06-16 09:31:08 so that may be a better option if we want a cluster with nodes in different subnets 2025-06-16 09:31:34 I could maybe ask kube-router devs if there are other options 2025-06-16 09:32:03 i was even thinking that it could be an idea to add dmvpn capabilities to kube-router, but not sure its worth it 2025-06-16 10:36:30 I deployed just the controller+worker nodes with calico, and I think everything is working so far 2025-06-16 10:36:37 at least, everything shows healthy 2025-06-16 10:55:32 Deploying worker nodes, and they have issues (konnectivity not being able to resolve dns, for example) 2025-06-16 13:47:33 gitlab is sending 4x mails again xd 2025-06-16 13:48:57 Fun 2025-06-16 15:05:36 ncopa: do you know what the reason is for kube-router needs L2 connectivity to other nodes? 2025-06-16 15:05:42 needing* 2025-06-16 15:06:22 just trying to understand the constraints :) 2025-06-16 15:19:38 Gitlab appears to be sending me email notifications multiple times. What's interesting is that the "reply to" email is different in each duplicate 2025-06-16 16:28:58 "did you see the memo? i'll go ahead and make sure you get another copy of that memo" -- lumbergh 2025-06-16 16:30:13 I mean, I get most of the emails sent 4 times... 2025-06-16 16:30:40 Actually, they are all sent 4 times 2025-06-16 16:36:56 trying to figure out why that's happening 2025-06-16 16:49:10 probably an issue with sidekiq 2025-06-16 19:46:50 lotheac: i dont know. I have assumed it was that kube-router alters the routing table instead of do NATing (this internal address is reached via node address) which will not work if node address is in different subnet (in which case you also need to inform the router about the internal address) 2025-06-16 19:47:18 but tbh, its only my assuming/guessing things. I haven't really studied how it actually works 2025-06-16 20:01:56 I think the firewall on the wg relay is blocking traffic from gre -> wg 2025-06-16 20:04:55 yup 2025-06-16 20:05:08 1 clientset.go:285] change detected in proxy server count (was: 0, now: 3, source: "KNP server response headers") :) 2025-06-16 23:54:29 ncopa: right, i haven't studied it either, but my assumption is the opposite -- ie. that it works on l3. but i don't really use kube-router anywhere so your assumption has maybe more basis than mine :) 2025-06-17 05:41:14 ikke: do you have the k0sctl config for calico somehere? 2025-06-17 05:44:23 lotheac: I've pushed it to the MR now 2025-06-17 05:44:45 I think it's just enough to change kuberouter for calico as provider 2025-06-17 05:44:58 Most options I set are default, except for mtu 2025-06-17 05:45:50 Still debugging: "could not read stream" err="rpc error: code = Unavailable desc = error reading from server: EOF" serverID="204f42e6e7ec7f04fa04df7316101ab4" agentID="172.16.252.34" 2025-06-17 05:45:56 from konnectivity agent 2025-06-17 06:28:52 there was just now a talk concerning bgp in this context here at kubecon 2025-06-17 06:29:17 so i wanna try things out once i have some time 2025-06-17 06:29:51 the gist was that running two bgpd’s, the one in the cni needs to peer with localhost 2025-06-17 06:30:54 i’m not really familiar with bgp, but seems like that’s what could be going on here as well 2025-06-18 12:53:35 is the loongarch64 ci runner having some trouble? 2025-06-18 12:54:45 i guess it works after a retry but i feel like the failures are more often than usual 2025-06-18 12:57:36 hmmm 2025-06-18 12:57:53 I remember seeing failures on the loongarch64 runner before but didn't think much of it 2025-06-18 13:10:49 achill: I did unpause a runner yesterday 2025-06-18 13:10:57 achill: what kind of failures? 2025-06-18 13:30:17 i saw a failure on loongarch64 this morning as well https://gitlab.alpinelinux.org/lotheac/aports/-/jobs/1900547 -- retry fixed it 2025-06-18 13:30:33 seems like one of the tests timed out 2025-06-18 13:31:27 https://gitlab.alpinelinux.org/fossdd/aports/-/jobs/1900848 2025-06-18 13:35:01 ah not this kind of error 2025-06-18 13:53:52 Ah ok, the image is missing and not allowed to pull it 2025-06-18 15:36:27 Turned out that was the old builder used to bootstrap loongarch 2025-06-20 05:20:49 I've reduced the sensitivity of these icmp ping loss checks 2025-06-24 10:11:07 ikke: this is that talk i mentioned; video is not yet available but slides are https://static.sched.com/hosted_files/kccncjpn2025/6f/KubeCon%20Japan%202025%20BGP%20Peering%20Patterns%20for%20Kubernetes%20Networking%20at%20Preferred%20Networks.pdf 2025-06-24 10:11:43 ikke: so the idea is... add a 127.0.0.1 peer to calico bgpd to connect the dmvpn network to the cluster network 2025-06-24 10:12:38 i don't have a dmvpn setup myself so i am not totally sure about this though :) 2025-06-24 10:12:57 it'd be something like this https://docs.tigera.io/calico/latest/networking/configuring/bgp#configure-a-global-bgp-peer 2025-06-24 10:19:39 Interesting 2025-06-24 10:19:54 I guess it means that each cluster would need to have a unique subnet? 2025-06-24 10:20:08 yeah, that's right 2025-06-24 10:20:49 does pod-to-pod connectivity work in the cluster with the current configuration? 2025-06-24 10:21:02 I haven't extensively tested that yet 2025-06-24 10:21:04 if the pods are on different nodes i mean 2025-06-24 10:21:17 i have a hunch that it might not work as is 2025-06-24 10:22:56 konnectivity-agent still complains about: "could not read stream" err="rpc error: code = Unavailable desc = error reading from server: EOF" serverID="3c5fa1d2e2bd51ad14e4422662393535" agentID="172.16.250.10" 2025-06-24 10:24:51 not totally sure if that is related. i don't think the error would be EOF if some address was not routable 2025-06-24 10:25:03 yeah, true 2025-06-24 10:27:49 ah, okay, we're using vxlan for calico which means it doesn't use bgp anyway 2025-06-24 10:28:12 or at least that's my reading of the docs 2025-06-24 10:33:40 so i guess it might be working already... at least aside from the konnectivity-agent thing 2025-06-24 10:39:03 lotheac: I have something you could help with, if this is something you like to do: https://gitlab.alpinelinux.org/alpine/infra/k8s/ci-cplane-1/-/merge_requests/5 2025-06-24 10:39:16 This deploy gitlab-runner based on the helm chart from gitlab 2025-06-24 10:39:36 The problem is that the entrypoint scripts they use only allows for a single runner 2025-06-24 10:39:56 They inject the entrypoint scripts as configmaps 2025-06-24 10:40:39 It would be nice if we could have some way to deploy multiple runners on the same instance 2025-06-24 10:40:42 could always just do helm template to dump a starting point and massage the manifests as needed 2025-06-24 10:40:56 i'll check it out 2025-06-24 10:41:02 That's what I use kustomize patches for 2025-06-24 10:41:23 in the meantime i was wondering about the git repo -> cluster deployment of the kube resources 2025-06-24 10:41:34 am i correct that you're doing things semi-manually? 2025-06-24 10:42:06 lotheac: The lower-level stuff, I still do with kapp deploy 2025-06-24 10:42:30 Creating namespaces, roles etc 2025-06-24 10:42:50 But deploying the apps is automated in CI 2025-06-24 10:42:57 i warmly recommend fluxcd. give it credentials to read git, point it to the repo url, and things automatically happen when merged 2025-06-24 10:43:15 and then you can put the secrets (encrypted) in the repo as well with sops+age 2025-06-24 10:43:22 https://gitlab.alpinelinux.org/alpine/infra/k8s/ci-cplane-1/-/jobs/1901167 2025-06-24 10:44:17 lotheac: yeah, I have been looking at fluxcd 2025-06-24 10:45:00 i haven't used kapp, but seems pretty straightforward 2025-06-24 10:45:35 the only thing i'm wondering about is -- if gitlab runners apply the kube manifests, i guess that won't happen if the runners happen to not be working :) 2025-06-24 10:45:55 yeah. There are some challenges sometimes due to kubernetes updating things as well, which kapp then sees as changes 2025-06-24 10:46:08 right, flux is pretty good at that stuff 2025-06-24 10:46:53 gitlab has integration with fluxcd, but it does require an extra agent in gitlab as well, and some configuration, which is poorly documented 2025-06-24 10:48:11 lotheac: the advantage of using kustomize patches is that you can easily update the helmchart without having to manually redo all the changes 2025-06-24 10:49:41 imho it depends on what you're deploying 2025-06-24 10:49:58 and how it's maintained upstream 2025-06-24 10:50:02 Yes, ofcourse 2025-06-24 10:50:19 If it's completely different from what the helm chart provides, than I would not even bother with the helm chart 2025-06-24 10:50:55 i'm using a weird combination of flux Kustomize and HelmRelease resources, sometimes with postRenderer patches, and other times just straight up kustomize of more-or-less manually-maintained manifests 2025-06-24 10:51:04 depending on the thing 2025-06-24 10:51:12 nod 2025-06-24 10:52:02 lotheac: So to give a bit more insight of what my goal is: I need at least two runners, one for x86_64 jobs and one for x86 jobs 2025-06-24 10:52:43 https://gitlab.alpinelinux.org/alpine/infra/k8s/ci-cplane-1/-/blob/master/k0sctl.yaml?ref_type=heads#L25-40 2025-06-24 10:52:50 okay, that makes sense... did you try just having two instances of the the helm release? 2025-06-24 10:53:29 Yes, that's always a possibility 2025-06-24 10:53:29 i'm gonna take a closer look later, but I gotta be somewhere in a few minutes so :) 2025-06-24 10:53:34 yeah, no problem 2025-06-24 10:54:03 but anyways, yeah, seems like it won't be a problem anyhow. we just want to end up with two deployments with different nodeselectors 2025-06-24 10:54:09 (or daemonsets) 2025-06-24 10:54:27 Technically we only need a single gitlab-runner instance 2025-06-24 10:55:00 sounds like i don't understand how it works on that level yet, but i will take a look :D 2025-06-24 10:55:57 The node affinity stuff is configured in the kubernetes executor config 2025-06-24 10:56:10 https://docs.gitlab.com/runner/executors/kubernetes/#define-a-list-of-node-affinities 2025-06-24 10:56:26 that basically affects the pods the runner creates for CI jobs 2025-06-24 10:57:39 (toml is really bad for this kind of deeply nested config) 2025-06-24 10:58:55 okay I see, so ”runner” is the term for the thing that manages pods for jobs 2025-06-24 10:59:22 and you want one runner to be able to manage different kinds of pods 2025-06-24 10:59:39 One agent can manage multiple runners 2025-06-24 11:00:13 But the default helm chart only supports one runner per agent 2025-06-24 11:00:58 right, gotcha 2025-06-24 11:01:56 Later I'd also like to incorporate other runners, so having one agent per runner scales quite poorly 2025-06-24 13:20:20 ikke: are the alpinelinux.org/arch=x86 machines actually running a 32-bit OS, or 32-bit hardware? 2025-06-24 13:20:53 No 2025-06-24 13:21:24 We run the containers with linux32 2025-06-24 13:22:25 ok, i see 2025-06-24 13:23:15 i was wondering about gitlab-runner-helper -- which apparently needs to run on the runner nodes, and doesn't have an x86 build available 2025-06-24 13:23:42 kubernetes.io/arch is still amd64 on these nodes? 2025-06-24 13:23:56 We do have our own runner-helper image 2025-06-24 13:23:59 ah 2025-06-24 13:24:05 lotheac: yes 2025-06-24 13:24:51 how does a job specify it wants to run on an x86 runner? i'm still a bit confused about the logic in the agent 2025-06-24 13:25:02 Through tags 2025-06-24 13:25:34 With the current workflow, we create a runner in advance in gitlab, where we set tags 2025-06-24 13:26:13 That gives a token, which we provide to the runner 2025-06-24 13:26:30 So jobs select they want a runner with the x86 tag, and then only runners that have that tag will pick up the job 2025-06-24 13:26:45 i can't find anything called "tag" in [[runners]] https://docs.gitlab.com/runner/configuration/advanced-configuration/#the-runners-section that would correspond to that 2025-06-24 13:27:34 It's not part of the runner configuration itself, put associated with the token the runner uses (previously, it could be provided through the register command) 2025-06-24 13:27:48 hello did you have some dns problem ? 2025-06-24 13:27:57 fred42: Not that I'm aware of 2025-06-24 13:28:00 lotheac: https://docs.gitlab.com/runner/register/ 2025-06-24 13:28:27 hi, at work we have some trouble with the dns of the cdn if I test the dns with dnslookup it fail with 1.1.1.1 but it's a success with 8.8.8.8 2025-06-24 13:28:27 Address: 1.1.1.1#53 2025-06-24 13:28:27 nslookup alpinelinux.org 1.1.1.1 2025-06-24 13:28:27 Server: 1.1.1.1 2025-06-24 13:28:29 Non-authoritative answer: 2025-06-24 13:28:31 Name: alpinelinux.org 2025-06-24 13:28:33 Address: 213.219.36.190 2025-06-24 13:28:35 ;; Got SERVFAIL reply from 1.1.1.1 2025-06-24 13:28:37 ** server can't find alpinelinux.org: SERVFAIL 2025-06-24 13:28:39 nslookup alpinelinux.org 8.8.8.8 2025-06-24 13:28:41 Server: 8.8.8.8 2025-06-24 13:28:43 Address: 8.8.8.8#53 2025-06-24 13:28:45 Non-authoritative answer: 2025-06-24 13:28:47 Name: alpinelinux.org 2025-06-24 13:28:49 Address: 213.219.36.190 2025-06-24 13:28:51 Name: alpinelinux.org 2025-06-24 13:28:53 ikke: thanks 2025-06-24 13:28:53 Address: 2a01:7e00:e000:2fc::4 2025-06-24 13:28:58 fred42: please use a paste service 2025-06-24 13:29:05 like you've been told before 2025-06-24 13:29:53 sorry I was disconnected and didnt see a reply in the other channel 2025-06-24 13:31:03 linode is hosting our dns 2025-06-24 13:39:51 here is a link to a paste service https://paste.centos.org/view/44885497 I will look at linode status 2025-06-24 13:59:45 this particular helm chart is starting to look like it's better to just hand-manage the resources :p lots of weirdness with startup shell scripts stored in a configmap and assumptions abound 2025-06-24 14:00:24 like you said - there's no way in this chart to register multiple runners from multiple tokens 2025-06-24 14:22:53 ikke: did you consider using the runner operator? https://docs.gitlab.com/runner/configuration/configuring_runner_operator/ 2025-06-24 14:59:19 ikke: here's one idea https://gitlab.alpinelinux.org/alpine/infra/k8s/ci-cplane-1/-/merge_requests/6 2025-06-24 15:10:27 lotheac: no, I was not aware of the operator 2025-06-24 15:10:43 that one might be a better idea 2025-06-24 15:11:05 without looking at how it's implemented :) 2025-06-24 15:11:28 at least on a theoretical level it would make sense to have the operator manage the lifecycle of the runners 2025-06-24 15:12:36 it should be able to use the reconciliation loop to make sure they're registered etc 2025-06-24 15:12:58 https://gitlab.com/gitlab-org/gl-openshift/gitlab-runner-operator 2025-06-24 15:13:17 Does it work outside of openshift? 2025-06-24 15:14:03 i don't see why it wouldn't 2025-06-24 15:14:15 "The GitLab Runner operator aims to manage the lifecycle of GitLab Runner instances in your Kubernetes or Openshift container platforms" 2025-06-24 15:14:52 "It therefore will presumably run on any container platform that is derived from Kubernetes' 2025-06-24 15:14:54 Ok, I see 2025-06-24 15:17:48 the helm chart, as well as my MR, are quite hacky... i'm kinda expecting/assuming the operator to bear the responsibility for whatever hacks it needs to do :p on paper the CRD looks pretty good, can even apply tags separately to each runner 2025-06-24 15:18:11 Yeah, managing the runners with CRDs looks nice 2025-06-24 15:18:59 But looking at the install instructions, they want you to install an Operator Lifecycle Manager with curl | bash 2025-06-24 15:19:16 yeah, no, bad instructions :D 2025-06-24 15:22:44 that seems to be some sort of meta-operator, heh 2025-06-24 15:22:50 yeah 2025-06-24 15:22:57 give me the sauce 2025-06-24 15:23:53 https://gitlab.com/gitlab-org/gl-openshift/gitlab-runner-operator/-/releases 2025-06-24 15:24:27 yup, that's it 2025-06-24 15:25:19 though it includes something that won't apply: apiVersion: operators.coreos.com/v1alpha1, kind: ClusterServiceVersion 2025-06-24 15:25:41 could just patch that out with kustomize 2025-06-24 15:26:10 node 2025-06-24 15:26:12 nod 2025-06-24 15:27:23 but... looks like that resource is what is actually responsible for creating the operator deployment 2025-06-24 15:27:33 so it needs some mangling 2025-06-24 15:28:22 ah, no, i guess i looked at a file that was meant for openshift only. operator.k8s.yaml looks better 2025-06-24 15:28:54 ah 2025-06-24 15:39:09 lotheac: thanks, I'll look at the operator when I have time 2025-06-24 15:39:20 cheers, no problem 2025-06-25 04:31:59 ikke: on my own cluster (without a gitlab that i can register my runners on), this creates reasonable looking pods https://gitlab.alpinelinux.org/alpine/infra/k8s/ci-cplane-1/-/merge_requests/7 2025-06-25 04:32:39 https://termbin.com/lxlx looking like this 2025-06-25 05:19:44 lotheac: thanks! Note that the runner pod itself does not necessarily need to run on the builder nodes. They can just run on the control plane nodes. 2025-06-25 05:22:25 So I think we need https://docs.gitlab.com/runner/configuration/configuring_runner_operator/#customize-configtoml-with-a-configuration-template to specify the node selector / affinity for the build pods 2025-06-25 05:23:23 But I think this looks really nice, much better then trying to do it through the regular helm chart 2025-06-25 05:28:48 sure, i mean you can just add a kustomize patch to the Deployment from operator.k8s.yaml 2025-06-25 05:28:57 if you want to make it run on cplane ndoes 2025-06-25 05:28:58 nodes* 2025-06-25 05:29:36 for the builder pods -- those which are configured in the Runner -- I already added nodeSelectors for in the example 2025-06-25 05:29:58 with runner.spec.podSpec 2025-06-25 05:32:08 at least that's how i think it works :-) 2025-06-25 05:35:17 Right, that would make sense 2025-06-25 05:36:34 hm, but i guess i'm mistaken: that podSpec patch was applied to the runner pod (which you can see on my termbin paste) 2025-06-25 05:37:03 not sure if it _also_ applies those patches to the actual builder pods 2025-06-25 05:38:14 Ok, so each runner still gets a separate pod, but I guess that's how the operator works 2025-06-25 05:38:20 yeah, seems so 2025-06-25 05:38:41 Then we would probably need the explicit config.toml template 2025-06-25 05:39:10 unless there is some hook from the operator into the build pods 2025-06-25 05:39:36 i think it's worth trying out as is and seeing what the actuar builder pods look like to verify our assumptions 2025-06-25 05:39:40 actual* 2025-06-25 05:39:48 yup, will try to test it this evening 2025-06-25 05:40:15 i don't think it's necessarily a huge problem to have a separate pod per runner... unless they consume a lot of resources 2025-06-25 05:40:22 No, should not 2025-06-25 05:40:28 but i would assume they are not that hungry 2025-06-25 05:41:44 CPU: 2, MEM: 217 2025-06-25 05:42:19 that's requests or actual usage? 2025-06-25 05:42:40 the runner pod in my example had resources: {} ie. unlimited 2025-06-25 05:43:01 (but also zero, for scheduling purposes) 2025-06-25 05:43:58 actual usage 2025-06-25 05:44:07 reading a bit more of the docs i think you're right, we probably need the custom config.toml template -- but at least we can provide that separately to each runner 2025-06-25 05:44:14 that's pretty high cpu then! 2025-06-25 05:44:15 42 MEM/R% 2025-06-25 05:45:47 I think that's milicpu 2025-06-25 05:45:52 ah okay :D 2025-06-25 05:46:02 i see an integer and assume a whole cpu 2025-06-25 05:46:20 0 %CPU/R 2025-06-25 05:46:41 yeah, no biggie then 2025-06-25 11:56:04 ikke btw: tiny nit, but the display name in https://gitlab.alpinelinux.org/alpine/infra/k8s/ci-cplane-1 is missing an "l" (cpane instead of cplane) 2025-06-25 11:56:47 can be changed from settings/general 2025-06-25 11:57:00 Ah yes. I fixed my local repo but didn't rename the project yet 2025-06-25 12:39:30 Seems the x86_64 3.22 builder is stuck, it isnt trying to pull from git 2025-06-25 12:40:02 or it's down alterntiavly 2025-06-25 13:08:27 I can check later 2025-06-25 13:45:20 thanks 2025-06-25 15:07:58 messagelib was hanging 2025-06-25 20:38:57 is there something wrong with the x86_64 builder again? I get that it takes a while to build chromium, but it's been at it for multiple hours now 2025-06-25 20:40:02 for over 6 hours apparently 2025-06-26 14:52:45 lotheac: oh in fact, I already did rename the project as well 2025-06-26 15:46:50 lotheac: seems like the permissions that the operator needs are quite broad 2025-06-26 15:47:34 a clusterrole that allows to read all secrets, create arbitrary role bindings 2025-06-26 15:47:48 I suppose that's the limit of rbac 2025-06-27 01:18:40 ikke: i don’t think it necessarily needs all that, could probably be reduced to a role on the target ns to watch for the Runner objs and to manage pods 2025-06-27 01:19:02 but i haven’t investigated 2025-06-27 01:19:49 any roles/rolebindings it needs for builder pods could theoretically be precreated 2025-06-27 03:46:27 ikke: It's late on my end but sharing before I forget, I have a first working cut of the mirror playbook. https://gitlab.alpinelinux.org/durrendal/deploy-mirror 2025-06-27 03:49:42 there's an example of the playbooks output in the readme, just to make it easier to see what it does. But I can also spin up a VM somewhere and push your ssh pubkey to it so you can poke the resulting system if that would be more helpful. 2025-06-27 04:23:03 ikke: yes, you renamed the project (the url is correct) but the display name is still missing the l 2025-06-27 04:23:35 i guess project name/url is separate from display name 2025-06-27 04:25:10 Right, fixed 2025-06-27 04:25:34 cheers :) 2025-06-27 11:28:45 btw i think the armhf CI has not a lot of disk space left 2025-06-27 12:39:51 is aports mirror on codeberg anyhow legitimate? 2025-06-27 12:40:38 https://codeberg.org/alpinemirrorbot?tab=activity 2025-06-27 13:02:04 oh thats mine lol 2025-06-27 13:02:12 i dont even remember what i was using it for 2025-06-27 13:02:20 i guess i can remove it and stop abusing codebergs resources 2025-06-27 15:13:53 achill: cleaning it up 2025-06-28 14:44:56 durrendal: thanks. Reason I asked is https://gitlab.alpinelinux.org/alpine/aports/-/issues/17298 2025-06-28 14:54:04 I'd be more than happy to help with documentation. It doesn't look like I can assign the issue to myself, but if you can assign it to me I'll roll with it! 2025-06-28 14:54:48 done 2025-06-28 14:55:40 Thanks! Do we have a target date for 3.23.0? 2025-06-28 14:55:49 ~november 2025-06-28 14:55:59 releases are always may and november 2025-06-28 15:00:52 Oh perfect I'll remember that going forward :) thanks ikke! 2025-06-28 15:01:25 The milestone also lists the date 2025-06-28 15:12:22 Oh it does! I entirely missed that