随着时间的更新,早期写过的Promethues监控k8s和grafana这一块文章已经有需要修改的地方。并且以前使用nfs为pod进行持久化,本地修改为使用ceph进行持久化,版本也进行更新迭代
环境信息
本次环境信息如下
Kubernetes | 1.23.5 |
Ceph | nautilus 14.2.22 |
[root@k8s-01 ~]# kubectl get pod -n ceph NAME READY STATUS RESTARTS AGE csi-rbdplugin-dzmdk 3/3 Running 3 (15h ago) 3d1h csi-rbdplugin-h8j2v 3/3 Running 3 (111s ago) 3d1h csi-rbdplugin-h9k95 3/3 Running 3 (15h ago) 3d1h csi-rbdplugin-hvq58 3/3 Running 3 (22h ago) 3d1h csi-rbdplugin-ngktg 3/3 Running 3 (15h ago) 3d1h csi-rbdplugin-provisioner-7557ccdb7c-4qczh 7/7 Running 7 (22h ago) 3d1h csi-rbdplugin-provisioner-7557ccdb7c-5tx4r 7/7 Running 9 (15h ago) 3d1h csi-rbdplugin-provisioner-7557ccdb7c-t8877 7/7 Running 0 15h [root@k8s-01 ~]# kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE csi-rbd-sc rbd.csi.ceph.com Delete Immediate true 3d1h [root@ceph-01 ~]# ceph -s cluster: id: c8ae7537-8693-40df-8943-733f82049642 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 (age 15h) mgr: ceph-02(active, since 15h), standbys: ceph-03, ceph-01 mds: cephfs-abcdocker:1 cephfs:1 {cephfs-abcdocker:0=ceph-02=up:active,cephfs:0=ceph-03=up:active} 1 up:standby osd: 4 osds: 4 up (since 15h), 4 in (since 3w) rgw: 2 daemons active (ceph-01, ceph-02) task status: data: pools: 13 pools, 656 pgs objects: 3.25k objects, 11 GiB usage: 36 GiB used, 144 GiB / 180 GiB avail pgs: 656 active+clean
Prometheus
Prometheus K8s安装之前的文章已经写了,此处就不详细介绍了。 我这里是使用ceph rbd进行持久化存储,storageclass已经创建完毕
Prometheus 持久化安装
新闻联播老司机
Prometheus监控Kubernetes 集群节点及应用
新闻联播老司机
创建PVC持久化目录
[root@k8s-01 prometheus]# kubectl create ns prometheus namespace/prometheus created cat >prometheus-pvc.yaml<<EOF kind: PersistentVolumeClaim apiVersion: v1 metadata: name: prometheus-data-db #pvc名称,这里不建议修改 spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi #创建pvc大小 storageClassName: csi-rbd-sc #这里是我的storageclass,请根据自己的实际情况修改 EOF #创建pvc [root@k8s-01 prometheus]# kubectl apply -f prometheus-pvc.yaml persistentvolumeclaim/prometheus-data-db created #查看pvc创建状况 [root@k8s-01 prometheus]# kubectl get pvc -n prometheus NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE prometheus-data-db Bound pvc-4c5cb159-2f59-44d8-840d-6ee6ceeb1f4c 50Gi RWO csi-rbd-sc 8s
安装Prometheus
Prometheus在之前的文章就已经介绍过了,本地只是将nfs存储引擎修改为ceph。并且deployment随着时间的变化进行更新维护
首先需要配置configmap
[root@k8s-01 prometheus]# cat prometheus.configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: prometheus data: prometheus.yml: | global: scrape_interval: 15s scrape_timeout: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'kubernetes-node' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ action: replace - job_name: 'kubernetes-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: kubernetes-nodes-cadvisor scrape_interval: 10s scrape_timeout: 10s scheme: https # remove if you want to scrape metrics on insecure port tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor metric_relabel_configs: - action: replace source_labels: [id] regex: '^/machine.slice/machine-rktx2d([^]+).+/([^/]+).service$' target_label: rkt_container_name replacement: '${2}-${1}' - action: replace source_labels: [id] regex: '^/system.slice/(.+).service$' target_label: systemd_service_name replacement: '${1}' - job_name: kube-state-metrics static_configs: - targets: ['kube-state-metrics.prometheus.svc.cluster.local:8080'] #svc地址,如果在其它命名空间请将prometheus改为其它空间
接下来配置rbac授权文件,否则Deployment会没有权限创建pod
[root@k8s-01 prometheus]# cat prometheus-rbac.yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch - apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: prometheus
创建prometheus容器,以Deployment方式运行
[root@k8s-01 prometheus]# cat prometheus.deploy.yaml apiVersion: apps/v1 kind: Deployment metadata: name: prometheus namespace: prometheus labels: app: prometheus spec: selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus containers: - image: prom/prometheus:latest name: prometheus command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=30d" - "--web.enable-admin-api" # 控制对admin HTTP API的访问,其中包括删除时间序列等功能 - "--web.enable-lifecycle" # 支持热更新,直接执行localhost:9090/-/reload立即生效 ports: - containerPort: 9090 protocol: TCP name: http volumeMounts: - mountPath: "/prometheus" subPath: prometheus name: data - mountPath: "/etc/prometheus" name: config-volume resources: requests: cpu: 100m memory: 512Mi limits: cpu: 100m memory: 512Mi securityContext: runAsUser: 0 volumes: - name: data persistentVolumeClaim: claimName: prometheus-data-db - configMap: name: prometheus-config name: config-volume
接下来我们还需要给prometheus创建pvc
我这里后面grafana和prometheus会分开,所以我使用NodePort。如果有ingress可以使用cluster
[root@k8s-01 prometheus]# cat prometheus-svc.yaml apiVersion: v1 kind: Service metadata: name: prometheus namespace: prometheus labels: app: prometheus spec: selector: app: prometheus type: NodePort ports: - name: web port: 9090 targetPort: http
接下来在给我们K8s节点设置node-exporter
[root@k8s-01 prometheus]# cat prometheus-node.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: prometheus labels: name: node-exporter spec: selector: matchLabels: name: node-exporter template: metadata: labels: name: node-exporter spec: hostPID: true hostIPC: true hostNetwork: true containers: - name: node-exporter image: prom/node-exporter:v0.16.0 ports: - containerPort: 9100 resources: requests: cpu: 0.15 securityContext: privileged: true args: - --path.procfs - /host/proc - --path.sysfs - /host/sys - --collector.filesystem.ignored-mount-points - '"^/(sys|proc|dev|host|etc)($|/)"' volumeMounts: - name: dev mountPath: /host/dev - name: proc mountPath: /host/proc - name: sys mountPath: /host/sys - name: rootfs mountPath: /rootfs tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" volumes: - name: proc hostPath: path: /proc - name: dev hostPath: path: /dev - name: sys hostPath: path: /sys - name: rootfs hostPath: path: /
当我们所以配置文件都设置好后,我们直接kubectl apply -f .即可
[root@k8s-01 prometheus]# kubectl apply -f . daemonset.apps/node-exporter created persistentvolumeclaim/prometheus-data-db created serviceaccount/prometheus created clusterrole.rbac.authorization.k8s.io/prometheus created clusterrolebinding.rbac.authorization.k8s.io/prometheus created service/prometheus created configmap/prometheus-config created deployment.apps/prometheus created
检查pod运行状态
[root@k8s-01 prometheus]# kubectl get all -n prometheus NAME READY STATUS RESTARTS AGE pod/node-exporter-5jdhv 1/1 Running 0 16s pod/node-exporter-5pqth 1/1 Running 0 16s pod/node-exporter-fhj49 1/1 Running 0 17s pod/node-exporter-g7kc9 1/1 Running 0 16s pod/node-exporter-v5znf 1/1 Running 0 16s pod/prometheus-79f49d55b5-6qgc8 1/1 Running 0 13s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/prometheus NodePort 10.99.37.99 <none> 9090:30353/TCP 15s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/node-exporter 5 5 5 5 5 <none> 17s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/prometheus 1/1 1 1 13s NAME DESIRED CURRENT READY AGE replicaset.apps/prometheus-79f49d55b5 1 1 1 13s
访问prometheus测试
kube-state-metrics
Kube-State-Metrics简介
kube-state-metrics 通过监听 API Server 生成有关资源对象的状态指标,比如 Deployment 、Node 、 Pod ,需要注意的是 kube-state-metrics 只是简单的提供一个 metrics 数据,并不会存储这些指标数据,所以我们可以使用 Prometheus 来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如 Deployment 、 Pod 、副本状态等;调度了多少个 replicas ?现在可用的有几个?多少个 Pod 是running/stopped/terminated 状态? Pod 重启了多少次?我有多少 job 在运行中
yaml文件如下
本地命名空间为prometheus
[root@k8s-01 kube-metrics]# cat kube-metrics-deployment.yaml apiVersion: v1 automountServiceAccountToken: false kind: ServiceAccount metadata: labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: 2.4.2 name: kube-state-metrics namespace: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: 2.4.2 name: kube-state-metrics rules: - apiGroups: - "" resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: - list - watch - apiGroups: - apps resources: - statefulsets - daemonsets - deployments - replicasets verbs: - list - watch - apiGroups: - batch resources: - cronjobs - jobs verbs: - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create - apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch - apiGroups: - certificates.k8s.io resources: - certificatesigningrequests verbs: - list - watch - apiGroups: - storage.k8s.io resources: - storageclasses - volumeattachments verbs: - list - watch - apiGroups: - admissionregistration.k8s.io resources: - mutatingwebhookconfigurations - validatingwebhookconfigurations verbs: - list - watch - apiGroups: - networking.k8s.io resources: - networkpolicies - ingresses verbs: - list - watch - apiGroups: - coordination.k8s.io resources: - leases verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: 2.4.2 name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: prometheus --- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scraped: "true" # 设置能被prometheus抓取到,因为不带这个annotation prometheus-service-endpoints 不会去抓这个metrics labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: 2.4.2 name: kube-state-metrics namespace: prometheus spec: # clusterIP: None # 允许通过svc来进行访问 ports: - name: http-metrics port: 8080 targetPort: http-metrics - name: telemetry port: 8081 targetPort: telemetry selector: app.kubernetes.io/name: kube-state-metrics --- apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: 2.4.2 name: kube-state-metrics namespace: prometheus spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: 2.4.2 spec: nodeName: k8s-01 # 设置在k8s-master-1上运行 tolerations: # 设置能容忍在master节点运行 - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" automountServiceAccountToken: true containers: # - image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2 - image: anjia0532/google-containers.kube-state-metrics.kube-state-metrics:v2.4.2 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 name: kube-state-metrics ports: - containerPort: 8080 name: http-metrics - containerPort: 8081 name: telemetry readinessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 5 timeoutSeconds: 5 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true runAsUser: 65534 serviceAccountName: kube-state-metrics
创建完成后检查pod状态
[root@k8s-01 kube-metrics]# kubectl get pod -n prometheus NAME READY STATUS RESTARTS AGE grafana-5964564844-497ks 1/1 Running 0 53m kube-state-metrics-777847887b-4d49s 1/1 Running 0 4m33s node-exporter-5jdhv 1/1 Running 0 179m node-exporter-5pqth 1/1 Running 0 179m node-exporter-fhj49 1/1 Running 0 179m node-exporter-g7kc9 1/1 Running 0 179m node-exporter-v5znf 1/1 Running 0 179m prometheus-79f49d55b5-dx9kv 1/1 Running 0 3m23s
检查svc是否有metric数据
Grafana
接下来安装Grafana,同样也是基于原来的文章,只不过将Grafana做持久化存储
Grafana 安装并监控k8s集群
新闻联播老司机
[root@k8s-01 grafana]# cat grafana-pvc.yaml kind: PersistentVolumeClaim apiVersion: v1 metadata: name: grafana-data namespace: prometheus spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: csi-rbd-sc
创建完pvc后,我们创建deployment和svc
[root@k8s-01 grafana]# cat grafana-svc.yaml apiVersion: v1 kind: Service metadata: name: grafana namespace: prometheus labels: app: grafana spec: type: NodePort ports: - port: 3000 selector: app: grafana #Deployment [root@k8s-01 grafana]# cat grafana-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: grafana namespace: prometheus labels: app: grafana spec: revisionHistoryLimit: 10 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana imagePullPolicy: IfNotPresent ports: - containerPort: 3000 name: grafana env: - name: GF_SECURITY_ADMIN_USER value: admin - name: GF_SECURITY_ADMIN_PASSWORD value: abcdocker readinessProbe: failureThreshold: 10 httpGet: path: /api/health port: 3000 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 30 livenessProbe: failureThreshold: 3 httpGet: path: /api/health port: 3000 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: 300m memory: 1024Mi requests: cpu: 300m memory: 1024Mi volumeMounts: - mountPath: /var/lib/grafana subPath: grafana name: storage securityContext: fsGroup: 472 runAsUser: 472 volumes: - name: storage persistentVolumeClaim: claimName: grafana-data
创建grafana所有yaml
[root@k8s-01 grafana]# kubectl apply -f .
检查grafana
[root@k8s-01 grafana]# kubectl get all -n prometheus |grep grafana pod/grafana-5964564844-497ks 1/1 Running 0 85m service/grafana NodePort 10.110.65.126 <none> 3000:31923/TCP 3h5m deployment.apps/grafana 1/1 1 1 3h12m replicaset.apps/grafana-5964564844 1 1 1 3h12m
配置grafana
这里grafana只讲几个重点配置部分
http://prometheus.prometheus.svc.cluster.local:9090这里的需要修改中间的prometheus,这个prometheus为namespace。不懂的看一下svc这块知识
https://grafana.com/grafana/dashboards/15661
ID: 15661
效果图
https://grafana.com/grafana/dashboards/6417ID: 6417
可以参考我的站点:https://grafana.frps.cn/d/PwMJtdvnz/k8s-for-prometheus-dashboard?orgId=1&from=1656171971730&to=1656173771730
数据不会更新
相关文章:
- Kubernetes 1.14 二进制集群安装
- Kubenetes 1.13.5 集群二进制安装
- Kuerbernetes 1.11 集群二进制安装
- Prometheus Operator 监控k8s组件