问题现象:
在k8s上面使用calico网络插件,这两天发现dns服务异常,经过排查,发现dns的两个pod,位于master node上面的ip是不能被ping通的,导致了dns服务不能正常提供服务。
然后查看网络插件的pod,发现位于master节点上的calico-node服务,不正常
错误如下:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-7cd8b89887-vfzwc 1/1 Running 2 (117d ago) 132d 10.244.118.109 xy-5-server14
calico-node-9qtv5 1/1 Running 0 132d 192.168.5.19 xy-5-server19
calico-node-lxg9k 0/1 Running 0 34s 192.168.5.14 xy-5-server14
calico-node-rmscn 1/1 Running 0 33s 192.168.5.17 xy-5-server17
calico-typha-d4f58c4c9-8nf76 1/1 Running 0 132d 192.168.5.17 xy-5-server17
calico-typha-d4f58c4c9-dbf8g 1/1 Running 0 132d 192.168.5.14 xy-5-server14
csi-node-driver-92rbg 2/2 Running 0 132d 10.244.116.196 xy-5-server17
csi-node-driver-gpgwd 2/2 Running 0 132d 10.244.6.82 xy-5-server19
csi-node-driver-h9kbw 2/2 Running 0 132d 10.244.118.101 xy-5-server14
[root@xy-5-server14 calico]# kubectl -n calico-system describe pod calico-node-lxg9k
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47s default-scheduler Successfully assigned calico-system/calico-node-lxg9k to xy-5-server14
Normal Pulled 47s kubelet Container image "docker.io/calico/pod2daemon-flexvol:v3.24.5" already present on machine
Normal Created 47s kubelet Created container flexvol-driver
Normal Started 47s kubelet Started container flexvol-driver
Normal Pulled 46s kubelet Container image "docker.io/calico/cni:v3.24.5" already present on machine
Normal Created 45s kubelet Created container install-cni
Normal Started 45s kubelet Started container install-cni
Normal Pulled 42s kubelet Container image "docker.io/calico/node:v3.24.5" already present on machine
Normal Created 42s kubelet Created container calico-node
Normal Started 41s kubelet Started container calico-node
Warning Unhealthy 40s (x2 over 41s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
Warning Unhealthy 37s kubelet Readiness probe failed: 2023-07-18 08:18:19.246 [INFO][379] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
Warning Unhealthy 27s kubelet Readiness probe failed: 2023-07-18 08:18:29.242 [INFO][423] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
Warning Unhealthy 17s kubelet Readiness probe failed: 2023-07-18 08:18:39.246 [INFO][455] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
Warning Unhealthy 7s kubelet Readiness probe failed: 2023-07-18 08:18:49.249 [INFO][486] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
推而广之,发现所有的位于master节点上面的pod的ip,均不能正常ping通
问题发现
安装calico的客户端:参考:www.cnblogs.com/varden/p/15…
在master上面:
[root@xy-5-server14 ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+-------------+
| 192.168.5.17 | node-to-node mesh | up | 08:36:51 | Established |
| 192.168.5.19 | node-to-node mesh | up | 08:37:15 | Established |
+--------------+-------------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
连接正常...
在node1上面
[root@xy-5-server17 ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+------------+-------------+
| 192.168.5.19 | node-to-node mesh | up | 2023-03-07 | Established |
| 10.4.0.1 | node-to-node mesh | start | 2023-07-17 | Connect |
+--------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
发现问题了吧,master的地址正常应该使用的是192.168.5.14,这个却使用的是10.4.0.1这个ip。
同样,在node2上面,也发现相同的问题
[root@xy-5-server19 ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+------------+-------------+
| 192.168.5.17 | node-to-node mesh | up | 08:18:24 | Established |
| 10.4.0.1 | node-to-node mesh | start | 2023-07-17 | Connect |
+--------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
在网上找到相同的遭遇的帖子:www.jianshu.com/p/4b175e733…
cloud.tencent.com/developer/a…
需要指定网卡,但是我使用的是operator安装的calico,直接修改calico-node的statefulset是不起作用的,会被operator改回去。跟文中的描述不一致。
问题解决
在calico官网找到相关配置:docs.tigera.io/calico/late…
然后在k8s集群中找到
[root@xy-5-server17 ~]# kubectl get Installation
NAME AGE
default 155d
[root@xy-5-server17 ~]# kubectl edit Installation default
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
creationTimestamp: "2023-02-13T09:18:18Z"
finalizers:
- tigera.io/operator-cleanup
generation: 3
name: default
resourceVersion: "151883088"
uid: 580c6998-4b1e-4616-8c0b-7a3fc4adf553
spec:
calicoNetwork:
bgp: Enabled
hostPorts: Enabled
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16
disableBGPExport: false
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
linuxDataplane: Iptables
multiInterfaceMode: None
nodeAddressAutodetectionV4:
interface: ens4f1
cni:
ipam:
type: Calico
type: Calico
controlPlaneReplicas: 2
flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
kubeletVolumePluginPath: /var/lib/kubelet
nodeUpdateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
nonPrivileged: Disabled
variant: Calico
status:
computed:
calicoNetwork:
bgp: Enabled
hostPorts: Enabled
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16
disableBGPExport: false
将
nodeAddressAutodetectionV4:
interface: ens4f1
这段配置,改成文档中描述的那样,设置自己的网卡即可
然后发现master节点上的calico-node pod运行正常,dns pod的ip可以ping通,dns服务恢复正常,问题得到了解决。