通常,无状态的应用多数情况下以deployment控制器下运行,在deployment更新中,当配置清单发生变化后,应用这些新的配置。
我们假设一些都ok,也成功拉取镜像,并且以默认的25%进行滚动更新,直到更新完成。
kubectl apply -f test1.yaml
然而这一切按照预期进行,没有任何问题。
kubectl只是将配置推送到k8s后,只要配置清单没有语法或者冲突问题,返回的是0,状态就是成功的
而整个过程有很多不确定性,比如,不存在的镜像,没有足够的资源调度,配置错误导致的一系列问题,而捕捉这种问题也是比较关键的事情之一。
这并不单纯的简单观测问题,pod并不是拉取镜像就被running起,一旦runing就意味着接收到流量,而程序准备需要时间,如果此时程序没有准备好,流量就接入,势必会出现错误。为了解决这个问题,就需要配置就绪检测或者Startup检测
pod在被真正的处于ready起来之前,通常会做就绪检测,或者启动检测。在之前的几篇中,我记录了就绪检测和健康检测的重要性,而在整个就绪检测中,是有一个初始化时间的问题。
如果此时,配置清单发送变化,调度器开始执行清单任务。假设此时的初始化准备时间是100秒,有30个pod,每次至少保证有75%是正常运行的,默认按照25%滚动更新Updating a Deployment。此时的准备时间(秒)至少是
30 / 25% * (100)readiness probe time
如果pod越多,意味着等待所有pod就绪完成的总时间就越长,如果放在cd管道中去运行,势必会让反馈时间越久。当一个重量级的集群中,每一条全局遍历都非常消耗资源,因此操作非常昂贵。整个集群有可能因此产生大的延迟,在集群外部调用API间隔去获取远比实时获取消耗资源要少。
如果pod并不多,这个问题不值得去考量。使用rollout足以解决。
获取清单被推送到节点的方式,有如下:
以及其他第三方的编码来达到这个需求,比如由额外的程序来间隔时间差去检测状态,而不是一直watch
通常,使用kubectl rollout或者helm的--wait,亦或者argocd的平面控制来观测
rollout
在kubernetes的文档中,rollout的页面中提到的检查状态
rollout能够管理资源类型如:部署,守护进程,状态
阅读rolling-back-a-deployment中的status watch,得到以下配置
kubectl -n NAMESPACE rollout status deployment NAME --watch --timeout=Xm
rollout下的其他参数
history
列出deployment/testv1的历史
kubectl -n default rollout history deployment/testv1
查看历史记录的版本1信息
kubectl -n default rollout history deployment/testv1 --revision=1
pause
停止,一旦停止,更新将不会生效
kubectl rollout pause deployment/testv1
需要恢复,或者重启
resume
恢复,恢复后延续此前暂停的部署
kubectl rollout resume deployment/testv1
status
此时可以配置status查看更新过程的状态
kubectl rollout status deployment/testv1
status提供了一下参数,比如常用的超时
比如,--timeout=10m,最长等待10m,超过10分钟就超时
kubectl rollout status deployment/testv1 --watch --timeout=10m
其他参数如下
Name | Shorthand | Default | Usage |
---|---|---|---|
filename | f | [] | Filename, directory, or URL to files identifying the resource to get from a server. |
kustomize | k | Process the kustomization directory. This flag can't be used together with -f or -R. | |
recursive | R | false | Process the directory used in -f, --filename recursively. Useful when you want to manage related manifests organized within the same directory. |
revision | 0 | Pin to a specific revision for showing its status. Defaults to 0 (last revision). | |
timeout | 0s | The length of time to wait before ending watch, zero means never. Any other values should contain a corresponding time unit (e.g. 1s, 2m, 3h). | |
watch | w | true | Watch the status of the rollout until it's done. |
undo
回滚回滚到上一个版本
kubectl rollout undo deployment/testv1
回滚到指定的版本
1,查看已有版本
# kubectl -n default rollout history deployment/testv1
deployment.apps/testv1
REVISION CHANGE-CAUSE
1 <none>
2 <none>
3 <none>
查看版本信息
# kubectl rollout history deployment/testv1
deployment.apps/testv1
REVISION CHANGE-CAUSE
2 <none>
7 <none>
8 <none>
2,回滚到2
# kubectl rollout undo deployment/testv1 --to-revision=2
deployment.apps/testv1 rolled back
helm
--wait
if set, will wait until all Pods, PVCs, Services, and minimum number of Pods of a Deployment, StatefulSet, or ReplicaSet are in a ready state before marking the release as successful. It will wait for as long as --timeout
只有在控制器的pod出于就绪状态才会结束,默认时间似乎是600秒·
看起来像是这样
helm upgrade --install --namespace NAMESPACE --create-namespace --wait APP FILE
API
上面两种方式能够完成大部分场景,但是watch是非常占用资源,如果希望通过一个脚本自己的逻辑去处理,可以使用clent-go的包手动for循环查看状态
- clent-go
手动去for循环
package main
import (
"context"
"flag"
"fmt"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
typev1 "k8s.io/client-go/kubernetes/typed/apps/v1"
"k8s.io/client-go/rest"
"k8s.io/client-go/util/retry"
"os"
"time"
)
type args struct {
namespace string
image string
deployment string
}
const (
numberOfPoll = 200
pollInterval = 3
)
func parseArgs() *args {
namespace := flag.String("n", "", "namespace")
deployment := flag.String("deploy", "", "deployment name")
image := flag.String("image", "", "image for update")
flag.Parse()
var _args args
if *namespace == "" {
fmt.Fprintln(os.Stderr, "namespace must be specified")
os.Exit(1)
}
_args.namespace = *namespace
if *deployment == "" {
fmt.Fprintln(os.Stderr, "deployment must be specified")
os.Exit(1)
}
_args.deployment = *deployment
if *image == "" {
fmt.Fprintln(os.Stderr, "image must be specified")
os.Exit(1)
}
_args.image = *image
return &_args
}
func main() {
_args := parseArgs()
// creates the in-cluster config
config, err := rest.InClusterConfig()
if err != nil {
panic(err.Error())
}
// creates the clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err.Error())
}
deploymentsClient := clientset.AppsV1().Deployments(_args.namespace)
ctx := context.Background()
retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
// Retrieve the latest version of Deployment before attempting update
// RetryOnConflict uses exponential backoff to avoid exhausting the apiserver
result, getErr := deploymentsClient.Get(ctx, _args.deployment, metav1.GetOptions{})
if getErr != nil {
fmt.Fprintf(os.Stderr, "Failed to get latest version of Deployment %s: %v", _args.deployment, getErr)
os.Exit(1)
}
result.Spec.Template.Spec.Containers[0].Image = _args.image
_, updateErr := deploymentsClient.Update(ctx, result, metav1.UpdateOptions{})
return updateErr
})
if retryErr != nil {
fmt.Fprintf(os.Stderr, "Failed to update image version of %s/%s to %s: %v", _args.namespace,
_args.deployment, _args.image, retryErr)
os.Exit(1)
}
_args.pollDeploy(deploymentsClient)
fmt.Println("Updated deployment")
}
// watch 太浪费资源了,而且时间太长,还是轮询吧
func (p *args) pollDeploy(deploymentsClient typev1.DeploymentInterface) {
ctx := context.Background()
for i := 0; i <= numberOfPoll; i++ {
time.Sleep(pollInterval * time.Second)
result, getErr := deploymentsClient.Get(ctx, p.deployment, metav1.GetOptions{})
if getErr != nil {
fmt.Fprintf(os.Stderr, "Failed to get latest version of Deployment %s: %v", p.deployment, getErr)
os.Exit(1)
}
resourceStatus := result.Status
fmt.Printf("%s -> replicas: %d, ReadyReplicas: %d, AvailableReplicas: %d, UpdatedReplicas: %d, UnavailableReplicas: %d\n",
result.Name,
resourceStatus.Replicas,
resourceStatus.ReadyReplicas,
resourceStatus.AvailableReplicas,
resourceStatus.UpdatedReplicas,
resourceStatus.UnavailableReplicas)
if resourceStatus.Replicas == resourceStatus.ReadyReplicas &&
resourceStatus.ReadyReplicas == resourceStatus.AvailableReplicas &&
resourceStatus.AvailableReplicas == resourceStatus.UpdatedReplicas {
return
}
}
fmt.Fprintf(os.Stderr, "应用在 %d 秒内没有启动成功,视作启动失败,请查看日志。\n", numberOfPoll*pollInterval)
os.Exit(1)
}
其他参考
kubernetes-deployment-status-in-jenkinsKubernetes探针补充Kubernetes Liveness 和 Readiness 探测避免给自己挖坑续集重新审视kubernetes活跃探针和就绪探针 如何避免给自己挖坑2Kubernetes Startup Probes避免给自己挖坑3