我们遇到的场景是CPUThrottlingHigh 警报被正常触发,而触发的对象的CPU本身并不高,或者空闲。鉴于此,我们开始怀疑这个警报的必然性。
通常在许多情况下,会将此警报修改或者沉默,因为应用程序对延迟不敏感,即使受到限制也可以正常工作,警报基于原因而非症状。因此警报的级别是Info。但是并不能说明此警报是误报。并且沉默只会隐藏背后的真正问题。
目前这个问题仍然在讨论中,特别是在这个讨论的特别激烈108,而后在67577也有进一步的讨论
表达式如下:
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 )
目前,总结了几种易处理的方式
1,修改警报阈值比例,或者禁止他
2,取消或者修改对这些 pod 的限制
3, 内核4.18或者更高
3,完全禁止Kubernetes CFS配额(kubelet配置--cpu-cfs-quota=false)
我们尝试修改阈值
kubectl -n monitoring edit PrometheusRule prometheus-k8s-rules
修改
- alert: CPUThrottlingHigh
annotations:
description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.'
runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/cputhrottlinghigh
summary: Processes experience elevated CPU throttling.
expr: |
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 75 / 100 )
for: 15m
labels:
severity: info
其他相关参考:
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108
https://github.com/prometheus-operator/prometheus-operator/issues/2063
https://github.com/kubernetes/kubernetes/issues/67577
https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
https://bugzilla.kernel.org/show_bug.cgi?id=198197
https://github.com/torvalds/linux/commit/512ac999d2755d2b7109e996a76b6fb8b888631d
https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
https://github.com/prometheus-operator/kube-prometheus/issues/214
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/453
https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/b71dd35c6a1d509a1ee902eebe7afe943d8ee4b0/alerts/resource_alerts.libsonnet#L13
https://www.youtube.com/watch?v=UE7QX98-kO0
https://github.com/prometheus-operator/kube-prometheus/issues/861
https://github.com/prometheus-operator/kube-prometheus/blob/main/jsonnet/kube-prometheus/components/alertmanager.libsonnet#L26-L42
https://devops.stackexchange.com/questions/6494/prometheus-alert-cputhrottlinghigh-raised-but-monitoring-does-not-show-it