关于kubeprometheus中CPUThrottlingHigh

2023年 7月 15日 78.0k 0

我们遇到的场景是CPUThrottlingHigh 警报被正常触发,而触发的对象的CPU本身并不高,或者空闲。鉴于此,我们开始怀疑这个警报的必然性。

通常在许多情况下,会将此警报修改或者沉默,因为应用程序对延迟不敏感,即使受到限制也可以正常工作,警报基于原因而非症状。因此警报的级别是Info。但是并不能说明此警报是误报。并且沉默只会隐藏背后的真正问题。

目前这个问题仍然在讨论中,特别是在这个讨论的特别激烈108,而后在67577也有进一步的讨论

表达式如下:

sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 )

目前,总结了几种易处理的方式

1,修改警报阈值比例,或者禁止他

2,取消或者修改对这些 pod 的限制

3, 内核4.18或者更高

3,完全禁止Kubernetes CFS配额(kubelet配置--cpu-cfs-quota=false)

我们尝试修改阈值

kubectl -n monitoring edit PrometheusRule  prometheus-k8s-rules

修改

    - alert: CPUThrottlingHigh      
      annotations:        
        description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.'        
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/cputhrottlinghigh        
        summary: Processes experience elevated CPU throttling.      
        expr: | 
          sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 75 / 100 )
        for: 15m      
        labels:        
          severity: info

其他相关参考:

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108
https://github.com/prometheus-operator/prometheus-operator/issues/2063
https://github.com/kubernetes/kubernetes/issues/67577
https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
https://bugzilla.kernel.org/show_bug.cgi?id=198197
https://github.com/torvalds/linux/commit/512ac999d2755d2b7109e996a76b6fb8b888631d
https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
https://github.com/prometheus-operator/kube-prometheus/issues/214
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/453
https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/b71dd35c6a1d509a1ee902eebe7afe943d8ee4b0/alerts/resource_alerts.libsonnet#L13
https://www.youtube.com/watch?v=UE7QX98-kO0
https://github.com/prometheus-operator/kube-prometheus/issues/861
https://github.com/prometheus-operator/kube-prometheus/blob/main/jsonnet/kube-prometheus/components/alertmanager.libsonnet#L26-L42
https://devops.stackexchange.com/questions/6494/prometheus-alert-cputhrottlinghigh-raised-but-monitoring-does-not-show-it

相关文章

对接alertmanager创建钉钉卡片(1)
手把手教你搭建OpenFalcon监控系统
无需任何魔法即可使用 Ansible 的神奇变量“hostvars”
openobseve HA本地单集群模式
基于k8s上loggie/vector/openobserve日志收集
openobseve单节点和查询语法

发布评论