Prometheus监控Ceph集群并设置AlertManager告警

云计算 2023-05-04 剑圣无痕手机阅读

之前写过ceph的搭建，那么我们ceph同样也需要prometheus进行监控数据。我这里使用prometheus监控ceph并配置alertmanager告警

文章目录

MGR

Ceph-deploy 快速部署Ceph集群

Ceph基础知识和基础架构

Ceph

Ceph Dashboard

Prometheus

Prometheus 监控MySQL数据库

AlertManager告警

AlertManager 微信告警配置

Grafana

Prometheus 监控MySQL数据库

告警测试

MGR

Manager:管理器 Ceph管理器守护程序（Cephmgr）负责跟踪运行时指标和Ceph群集的当前状态，包括存储利用率、当前性能指标和系统负载。Ceph管理器守护进程还托管基于python的模块，以管理和公开Ceph集群信息，包括基于web的Ceph仪表板和Restful APT。高可用性通常需要至少两个管理器。 CEPH安装可以参考下面文章

    <img src="https://img.mryunwei.com/uploads/2023/05/20230504023844986.png" alt>

        <img src="https://img.mryunwei.com/uploads/2023/05/20230504023844986.png" alt>

            <h2>Ceph-deploy 快速部署Ceph集群</h2>

                <img src="https://img.mryunwei.com/uploads/2023/05/20230504103846780.gif">新闻联播老司机
                <li>
                        22年2月18日</li>
                    <li>
                        喜欢：1
                    </li>
                    <li>
                        浏览：1.6k
                    </li>

    <img src="https://img.mryunwei.com/uploads/2023/05/20230504103846278.png" alt>

        <img src="https://img.mryunwei.com/uploads/2023/05/20230504103846278.png" alt>

            <h2>Ceph基础知识和基础架构</h2>

                <img src="https://img.mryunwei.com/uploads/2023/05/20230504103846780.gif">新闻联播老司机
                <li>
                        22年2月18日</li>
                    <li>
                        喜欢：0
                    </li>
                    <li>
                        浏览：1.1k
                    </li>

查看集群状态

[root@ceph-01 ~]# ceph -s
  cluster:
    id:     c8ae7537-8693-40df-8943-733f82049642
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 (age 2h)
    mgr: ceph-03(active, since 17h), standbys: ceph-01, ceph-02
    mds: cephfs-abcdocker:1 cephfs:1 {cephfs-abcdocker:0=ceph-01=up:active,cephfs:0=ceph-02=up:active} 1 up:standby
    osd: 4 osds: 4 up (since 17h), 4 in (since 3w)
    rgw: 2 daemons active (ceph-01, ceph-02)

  task status:

  data:
    pools:   13 pools, 656 pgs
    objects: 3.77k objects, 12 GiB
    usage:   41 GiB used, 139 GiB / 180 GiB avail
    pgs:     656 active+clean

  io:
    client:   2.0 KiB/s wr, 0 op/s rd, 0 op/s wr

#目前我们集群是ceph-03提供服务，ceph-02，ceph-01为从节点

Ceph

ceph-03为提供服务的节点，所以先需要在ceph-03开启dashboard

我们首先需要安装ceph dashboard

#在ceph-mgr节点开启dashboard

yum install -y ceph-mgr-dashboard

检查是否开启dashboard模块

默认没有开启

#如果可以过滤出来，则不需要开启
ceph mgr module ls | grep dashboard

#开启dashboard
ceph mgr module enable dashboard
#提示报错可以使用
ceph mgr module enable dashboard --force

设置mgr-dashboard监听信息

#设置监听地址
ceph config set mgr mgr/dashboard/server_addr 0.0.0.0

#设置端口号
ceph config set mgr mgr/dashboard/server_port 7000

#关闭ssl验证
ceph config set mgr mgr/dashboard/ssl false

创建管理员用户

最新的ceph dashboard不支持直接在命令行里面创建用户的密码,所以需要.先创建一个包含用户密码的文件

# 创建密码文本
cat >/opt/secretkey<
Ceph Dashboard
创建完用户名密码，我们就可以访问mgr:7000 端口测试
mgr-dashboard可以设置多个节点，然后nginx upstream代理
[root@ceph-03 ~]# lsof -i:7000
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
ceph-mgr 9568 ceph   36u  IPv4 353517      0t0  TCP *:afs3-fileserver (LISTEN)
还可以使用ceph命令查看dashboard节点
[root@ceph-03 ~]# ceph mgr services
{
    "dashboard": "http://ceph-03:7000/"
}
访问ceph mgr ip:7000 端口
http://ceph-mgr节点:7000/#/login?returnUrl=%2Fdashboard
输入用户名密码

登陆成功后界面

Prometheus
Promethues相关知识可以参考，Docker版本安装
    <img src="https://img.mryunwei.com/uploads/2023/05/20230504103850984.png" alt>

        <img src="https://img.mryunwei.com/uploads/2023/05/20230504103850984.png" alt>

            <h2>Prometheus 监控MySQL数据库</h2>

                <img src="https://img.mryunwei.com/uploads/2023/05/20230504103846780.gif">新闻联播老司机
                <li>
                        20年6月9日</li>
                    <li>
                        喜欢：0
                    </li>
                    <li>
                        浏览：4k
                    </li>
开启Prometheus metric模块
ceph mgr module enable prometheus
默认ceph mgr metric端口为9283
[root@ceph-03 ~]# lsof -i:9283
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
ceph-mgr 9568 ceph   33u  IPv6 360155      0t0  TCP *:callwaveiam (LISTEN)

#测试metric
[root@ceph-03 ~]# curl 127.0.0.1:9283/metrics|head
1. HELP ceph_mds_mem_dir_minus Directories closed
1. TYPE ceph_mds_mem_dir_minus counter
ceph_mds_mem_dir_minus{ceph_daemon="mds.ceph-01"} 0.0
1. HELP ceph_mds_mem_dir_plus Directories opened
1. TYPE ceph_mds_mem_dir_plus counter
ceph_mds_mem_dir_plus{ceph_daemon="mds.ceph-01"} 12.0
1. HELP ceph_osd_flag_norebalance OSD Flag norebalance
1. TYPE ceph_osd_flag_norebalance untyped
ceph_osd_flag_norebalance 0.0
加入Prometheus监控
  - job_name: 'ceph-mgr'
    static_configs:
    - targets: ['82.157.142.150:7002']

reload Prometheus，自行重启Prometheus
查看效果

AlertManager告警
监控项设置完了，我们设置一下ceph的告警规则
alertmanager搭建可以看下面的文章
    <img src="https://img.mryunwei.com/uploads/2023/05/20230504103854479.png" alt>

        <img src="https://img.mryunwei.com/uploads/2023/05/20230504103854479.png" alt>

            <h2>AlertManager 微信告警配置</h2>

                <img src="https://img.mryunwei.com/uploads/2023/05/20230504103846780.gif">新闻联播老司机
                <li>
                        22年2月14日</li>
                    <li>
                        喜欢：2
                    </li>
                    <li>
                        浏览：2.3k
                    </li>
[root@prometheus ~]# vim /etc/prometheus/rules/ceph_exporter.yaml

groups:
    - name: Ceph status
      rules:
      - alert: Ceph 实例不健康
        expr: ceph_health_status != 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph 实例不健康{{ $labels.instance }})
          description: "Ceph instance unhealthyn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: 检测到Ceph监视器时钟偏差
        expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph monitor clock skew (instance {{ $labels.instance }})
          description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settingsn  VALUE = {{ $value     }}n  LABELS = {{ $labels }}"
      - alert: Ceph监视器存储空间不足
        expr: ceph_monitor_avail_percent < 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph monitor low space (instance {{ $labels.instance }})
          description: "Ceph monitor storage is low.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: Ceph对象存储守护进程关闭
        expr: ceph_osd_up == 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph OSD Down (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon Downn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"

      - alert: Ceph高OSD延迟
        expr: ceph_osd_perf_apply_latency_seconds > 5
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: Ceph high OSD latency (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.n  VALUE     = {{ $value }}n  LABELS = {{ $labels }}"

      - alert: CephOSD空间不足
        expr: ceph_osd_utilization > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph OSD low space (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon is going out of space. Please add more disks.n  VALUE = {{ $value }}n      LABELS = {{ $labels }}"

      - alert: CephOSD重新加权
        expr: ceph_osd_weight < 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph OSD reweighted (instance {{ $labels.instance }})
          description: "Ceph Object Storage Daemon takes too much time to resize.n  VALUE = {{ $value }}n  LABELS = {{     $labels }}"
      - alert: CephPG下降
        expr: ceph_pg_down > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph PG down (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are down. Please ensure that all the data are available.n  VALUE = {{     $value }}n  LABELS = {{ $labels }}"
      - alert: CephPG不完整
        expr: ceph_pg_incomplete > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph PG incomplete (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.n  VALUE =     {{ $value }}n  LABELS = {{ $labels }}"
      - alert: CephPG不一致
        expr: ceph_pg_inconsistent > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Ceph PG inconsistent (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.n  VALUE     = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: CephPG激活时间长
        expr: ceph_pg_activating > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph PG activation long (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are too long to activate.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
      - alert: Ceph PG回填已满
        expr: ceph_pg_backfill_toofull > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Ceph PG backfill full (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be     unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.n  VALUE = {{ $value }}n  LABELS     = {{ $labels }}"
      - alert: Ceph PG不可用
        expr: ceph_pg_total - ceph_pg_active > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Ceph PG unavailable (instance {{ $labels.instance }})
          description: "Some Ceph placement groups are unavailable.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
重启Prometheus和alertmanager
[root@prometheus ~]# docker restart prometheus_new
prometheus_new
[root@prometheus ~]# docker restart alertmanager
alertmanager
进入Prometheus就可以看到alert，看到ceph 告警信息

Grafana
Grafana Docker安装
    <img src="https://img.mryunwei.com/uploads/2023/05/20230504103850984.png" alt>

        <img src="https://img.mryunwei.com/uploads/2023/05/20230504103850984.png" alt>

            <h2>Prometheus 监控MySQL数据库</h2>

                <img src="https://img.mryunwei.com/uploads/2023/05/20230504103846780.gif">新闻联播老司机
                <li>
                        20年6月9日</li>
                    <li>
                        喜欢：0
                    </li>
                    <li>
                        浏览：4k
                    </li>
Grafana上传模板
#我提供的grafana下载地址
wget https://d.frps.cn/file/tools/grafana/ceph/ceph-cluster_rev1.json
#将这个json下载下来

#windows下载
https://d.frps.cn/?tools/grafana/ceph

手动上传json
或者使用这个id2842https://grafana.com/grafana/dashboards/2842
9966也可以使用

告警测试
我们将ceph-02重启，检查是否可以告警
ceph状态已经异常
[root@ceph-01 ~]# ceph -s
  cluster:
    id:     c8ae7537-8693-40df-8943-733f82049642
    health: HEALTH_WARN
            1 filesystem is degraded
            insufficient standby MDS daemons available
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 3803/11409 objects degraded (33.333%), 330 pgs degraded
            1/3 mons down, quorum ceph-01,ceph-03

  services:
    mon: 3 daemons, quorum ceph-01,ceph-03 (age 16s), out of quorum: ceph-02
    mgr: ceph-03(active, since 40m), standbys: ceph-01, ceph-02
    mds: cephfs-abcdocker:1 cephfs:1/1 {cephfs-abcdocker:0=ceph-01=up:active,cephfs:0=ceph-03=up:replay}
    osd: 4 osds: 3 up (since 15s), 4 in (since 3w)
    rgw: 1 daemon active (ceph-01)

  task status:

  data:
    pools:   13 pools, 656 pgs
    objects: 3.80k objects, 13 GiB
    usage:   41 GiB used, 139 GiB / 180 GiB avail
    pgs:     3803/11409 objects degraded (33.333%)
             330 active+undersized+degraded
             326 active+undersized

  io:
    client:   115 KiB/s wr, 0 op/s rd, 3 op/s wr
Prometheus 已经提示告警

等待alertmanager发送告警内容
具体发送内容，自行优化即可

相关文章：
Ceph-deploy 快速部署Ceph集群 
Ceph OSD扩容与缩容 
Ceph集群日常使用命令 
Ceph RBD 备份与恢复