版本信息
-
Kubesphere
: v3.1.1 -
k8s
: v1.20.6 -
OS
: CentOS Linux release 7.9.2009 (Core)
故障现象
Kubesphere
面板无法登陆
ks-apiserver pod CrashLoopBackOff
kubesphere-system ks-apiserver-64f5ffb787-5jpxx 0/1 CrashLoopBackOff 7 12m
kubesphere-system ks-apiserver-64f5ffb787-6kp2m 0/1 CrashLoopBackOff 7 12m
kubesphere-system ks-apiserver-64f5ffb787-vg5h9 0/1 CrashLoopBackOff 7 12m
ks-apiserver events & logs
$ kubectl describe -n kubesphere-system po ks-apiserver-64f5ffb787-vg5h9
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m4s default-scheduler Successfully assigned kubesphere-system/ks-apiserver-64f5ffb787-vg5h9 to master1
Normal Pulling 2m3s kubelet Pulling image "registry.cn-beijing.aliyuncs.com/kubesphereio/ks-apiserver:v3.1.1"
Normal Pulled 99s kubelet Successfully pulled image "registry.cn-beijing.aliyuncs.com/kubesphereio/ks-apiserver:v3.1.1" in 23.839356408s
Normal Created 44s (x4 over 99s) kubelet Created container ks-apiserver
Normal Started 44s (x4 over 99s) kubelet Started container ks-apiserver
Warning BackOff 15s (x11 over 97s) kubelet Back-off restarting failed container
Normal Pulled 1s (x4 over 98s) kubelet Container image "registry.cn-beijing.aliyuncs.com/kubesphereio/ks-apiserver:v3.1.1" already present on machine`
$ kubectl logs ks-apiserver-64f5ffb787-vg5h9 -n kubesphere-system
W1013 09:42:09.264572 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
W1013 09:42:09.266720 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
Error: failed to connect to redis service, please check redis status, error: EOF
2023/10/13 09:42:09 failed to connect to redis service, please check redis status, error: EOF
原因追溯
redis-ha 状态
kubesphere-system redis-ha-haproxy-7cdc76dff9-gbh5b 1/1 Running 85 171d
kubesphere-system redis-ha-haproxy-7cdc76dff9-wc86v 1/1 Running 0 34h
kubesphere-system redis-ha-haproxy-7cdc76dff9-xw28x 1/1 Running 18 158d
kubesphere-system redis-ha-server-0 2/2 Running 29 171d
kubesphere-system redis-ha-server-1 2/2 Running 0 9m58s
kubesphere-system redis-ha-server-2 2/2 Running 0 9m53s
redis-ha 日志
$ kubectl -n kubesphere-system logs -l app=redis-ha-haproxy
[WARNING] 285/000146 (8) : Server check_if_redis_is_master_2/R2 is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (connect)", check duration: 2ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 285/000146 (8) : Server bk_redis_master/R2 is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (connect)", check duration: 2ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 285/000147 (8) : Server check_if_redis_is_master_0/R0 is DOWN, reason: Layer7 timeout, info: " at step 5 of tcp-check (expect string '10.233.62.231')", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000147 (8) : backend 'check_if_redis_is_master_0' has no server available!
[WARNING] 285/000147 (8) : Server check_if_redis_is_master_1/R0 is DOWN, reason: Layer7 timeout, info: " at step 5 of tcp-check (expect string '10.233.30.237')", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000147 (8) : backend 'check_if_redis_is_master_1' has no server available!
[WARNING] 285/000147 (8) : Server bk_redis_master/R0 is DOWN, reason: Layer7 timeout, info: " at step 5 of tcp-check (expect string 'role:master')", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000147 (8) : backend 'bk_redis_master' has no server available!
[WARNING] 285/013414 (8) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013423 (8) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[ALERT] 284/235728 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 284/235905 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 284/235915 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 284/235915 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000013 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/000023 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000023 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000133 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013414 (8) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013422 (8) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[ALERT] 284/235728 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 284/235905 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 284/235915 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 284/235915 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000013 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/000023 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000023 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000133 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013414 (8) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013423 (8) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
$ kubectl -n kubesphere-system exec -it redis-ha-server-0 redis-cli info replication
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulting container name to redis.
Use 'kubectl describe pod/redis-ha-server-0 -n kubesphere-system' to see all of the containers in this pod.
# Replication
role:slave
master_host:10.233.60.98
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1698241
master_link_down_since_seconds:1697161808
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:8321092349f590a2cc6603e90bb214fb2cfdc74f
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1698241
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
$ kubectl -n kubesphere-system exec -it redis-ha-server-0 -- sh -c 'for i in `seq 0 2`; do nc -vz redis-ha-server-$i.redis-ha.kubesphere-system.svc 6379; done'
Defaulting container name to redis.
Use 'kubectl describe pod/redis-ha-server-0 -n kubesphere-system' to see all of the containers in this pod.
redis-ha-server-0.redis-ha.kubesphere-system.svc (10.233.99.40:6379) open
redis-ha-server-1.redis-ha.kubesphere-system.svc (10.233.98.59:6379) open
redis-ha-server-2.redis-ha.kubesphere-system.svc (10.233.97.45:6379) open
[root@master1 ~]# kubectl -n kubesphere-system logs -l app=redis-ha
error: a container name must be specified for pod redis-ha-server-0, choose one of: [redis sentinel] or one of the init containers: [config-init]
[root@master1 ~]# kubectl -n kubesphere-system logs -l app=redis-ha-server-0
[root@master1 ~]#
[root@master1 ~]#
[root@master1 ~]# kubectl -n kubesphere-controls-system exec -it `kubectl -n kubesphere-controls-system get po -l kubesphere.io/username=admin -o jsonpath="{.items[0].metadata.name}"` -- sh -c 'nc -vz redis.kubesphere-system.svc:6379'
redis.kubesphere-system.svc:6379 (10.233.6.110:6379) open
redis-ha 干预
到上面感觉
redis-ha
完全正常,但是ks-apiserver
好像无法找到redis-ha
,索性重启redis-ha
试试
$ kubectl delete pods -n kubesphere-system -l app=redis-ha
结果发现并没有什么卵用。
ks-apiserver 干预
$ kubectl delete pods -n kubesphere-system -l app=ks-apiserver
结果发现并没有什么卵用。
kubelet 干预
到这里已经准备看
kubelet
日志信息,但领导又催得紧,就干脆先直接重启试试了
# 所有主节点上面执行
$ systemctl restart kubelet.service
结果发现,三个
ks-apiserver pod
已经恢复了俩,有个还在恢复中,索性再次重启ks-apiserver
,然后就完全恢复了。
ks-apiserver 新版
这里查看新版
kubesphere v3.4.0
并无依赖redis-ha
, 想必这里应该是个bug。
$ kubectl -n kubesphere-system get po
NAME READY STATUS RESTARTS AGE
ks-apiserver-bc897dfdf-l59k7 1/1 Running 0 14d
ks-console-6c64b9cb79-m7n2z 1/1 Running 1 (18d ago) 27d
ks-controller-manager-8f5855c98-kd7kg 1/1 Running 4 (18d ago) 27d
ks-installer-5b59ccbf49-6d745 1/1 Running 1 (18d ago) 27d
kubelet 日志查看
这里直接使用
vscode
插件直接分析日志了
GitHub Copilot: 根据 kubelet.logs 中的日志信息,可以看到以下事件:
1. Reflector ListAndWatch 失败,无法获取 kubesphere-token-gbwkf 的 Secret,错误信息为 "an error on the server ("") has prevented the request from succeeding (get secrets)"。
2. 无法写入事件,错误信息为 "Post "https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/events": read tcp xxxxxx:58772->xxxxxx:6443: use of closed network connection"。
3. 更新租约失败,错误信息为 "Operation cannot be fulfilled on leases.coordination.k8s.io "master1": the object has been modified; please apply your changes to the latest version and try again"。
4. kubelet 服务停止。
5. kubelet 服务启动,但是出现了一些警告信息,包括一些 GA 特性将在未来版本中被移除等。
以上信息并没有明确指出错误的原因,但是可以看到 kubelet 服务在启动和停止时都出现了一些问题,可能是由于网络或者配置等问题导致的。建议进一步排查 kubelet 服务的配置和网络环境是否正确。
最后总结
1、并无察觉到什么原因导致,可能是因为集群网络故障导致redis-ha
故障,而后自动恢复后,ks-apiserver
无法自动恢复,需要手动重启 kubelet 和 ks-apiserver
来解决。
2、这个并未在日志中找到明显异常,或者自己找的方式不太对,完全凭经验碰运气解决的,如果大佬们有更好的思路,可以分享出来学习下。
3、要问为啥用这么旧的版本,历史遗留问题,懂的都懂,后面应该会迁移,但啥时候呢,鬼知道。