Prometheus监控Ceph集群并设置AlertManager告警

之前写过ceph的搭建,那么我们ceph同样也需要prometheus进行监控数据。我这里使用prometheus监控ceph并配置alertmanager告警

文章目录

  • MGR
  • Ceph-deploy 快速部署Ceph集群
  • Ceph基础知识和基础架构
  • Ceph
  • Ceph Dashboard
  • Prometheus
  • Prometheus 监控MySQL数据库
  • AlertManager告警
  • AlertManager 微信告警配置
  • Grafana
  • Prometheus 监控MySQL数据库
  • 告警测试
  • 相关文章:
  • MGR

    Manager:管理器 Ceph管理器守护程序(Cephmgr)负责跟踪运行时指标和Ceph群集的当前状态,包括存储利用率、当前性能指标和系统负载。Ceph管理器守护进程还托管基于python的模块,以管理和公开Ceph集群信息,包括基于web的Ceph仪表板和Restful APT。高可用性通常需要至少两个管理器。
     
    CEPH安装可以参考下面文章

    Ceph-deploy 快速部署Ceph集群

    新闻联播老司机

  • 22年2月18日
  • 喜欢:1
  • 浏览:1.6k
  • Ceph基础知识和基础架构

    新闻联播老司机

  • 22年2月18日
  • 喜欢:0
  • 浏览:1.1k
  • 查看集群状态

    [root@ceph-01 ~]# ceph -s
      cluster:
        id:     c8ae7537-8693-40df-8943-733f82049642
        health: HEALTH_OK
    
      services:
        mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 (age 2h)
        mgr: ceph-03(active, since 17h), standbys: ceph-01, ceph-02
        mds: cephfs-abcdocker:1 cephfs:1 {cephfs-abcdocker:0=ceph-01=up:active,cephfs:0=ceph-02=up:active} 1 up:standby
        osd: 4 osds: 4 up (since 17h), 4 in (since 3w)
        rgw: 2 daemons active (ceph-01, ceph-02)
    
      task status:
    
      data:
        pools:   13 pools, 656 pgs
        objects: 3.77k objects, 12 GiB
        usage:   41 GiB used, 139 GiB / 180 GiB avail
        pgs:     656 active+clean
    
      io:
        client:   2.0 KiB/s wr, 0 op/s rd, 0 op/s wr
    
    #目前我们集群是ceph-03提供服务,ceph-02,ceph-01为从节点

    Ceph

    ceph-03为提供服务的节点,所以先需要在ceph-03开启dashboard

    我们首先需要安装ceph dashboard

    #在ceph-mgr节点开启dashboard
    
    yum install -y ceph-mgr-dashboard

    检查是否开启dashboard模块

    默认没有开启

    #如果可以过滤出来,则不需要开启
    ceph mgr module ls | grep dashboard
    
    #开启dashboard
    ceph mgr module enable dashboard
    #提示报错可以使用
    ceph mgr module enable dashboard --force

    设置mgr-dashboard监听信息

    #设置监听地址
    ceph config set mgr mgr/dashboard/server_addr 0.0.0.0
    
    #设置端口号
    ceph config set mgr mgr/dashboard/server_port 7000
    
    #关闭ssl验证
    ceph config set mgr mgr/dashboard/ssl false

    创建管理员用户

    最新的ceph dashboard不支持直接在命令行里面创建用户的密码,所以需要.先创建一个包含用户密码的文件

    # 创建密码文本
    cat >/opt/secretkey<<EOF
    123123
    EOF
    # 使用该文本作为密钥 
    ceph dashboard ac-user-create admin administrator -i /opt/secretkey 
    #administrator为管理员权限

    Ceph Dashboard

    创建完用户名密码,我们就可以访问mgr:7000 端口测试

    mgr-dashboard可以设置多个节点,然后nginx upstream代理

    [root@ceph-03 ~]# lsof -i:7000
    COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
    ceph-mgr 9568 ceph   36u  IPv4 353517      0t0  TCP *:afs3-fileserver (LISTEN)

    还可以使用ceph命令查看dashboard节点

    [root@ceph-03 ~]# ceph mgr services
    {
        "dashboard": "http://ceph-03:7000/"
    }

    访问ceph mgr ip:7000 端口

    http://ceph-mgr节点:7000/#/login?returnUrl=%2Fdashboard

    输入用户名密码
    Prometheus监控Ceph集群并设置AlertManager告警-每日运维
    登陆成功后界面
    1656256592835.png

    Prometheus

    Promethues相关知识可以参考,Docker版本安装

    Prometheus 监控MySQL数据库

    新闻联播老司机

  • 20年6月9日
  • 喜欢:0
  • 浏览:4k
  • 开启Prometheus metric模块

    ceph mgr module enable prometheus

    默认ceph mgr metric端口为9283

    [root@ceph-03 ~]# lsof -i:9283
    COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
    ceph-mgr 9568 ceph   33u  IPv6 360155      0t0  TCP *:callwaveiam (LISTEN)
    
    #测试metric
    [root@ceph-03 ~]# curl 127.0.0.1:9283/metrics|head
    # HELP ceph_mds_mem_dir_minus Directories closed
    # TYPE ceph_mds_mem_dir_minus counter
    ceph_mds_mem_dir_minus{ceph_daemon="mds.ceph-01"} 0.0
    # HELP ceph_mds_mem_dir_plus Directories opened
    # TYPE ceph_mds_mem_dir_plus counter
    ceph_mds_mem_dir_plus{ceph_daemon="mds.ceph-01"} 12.0
    # HELP ceph_osd_flag_norebalance OSD Flag norebalance
    # TYPE ceph_osd_flag_norebalance untyped
    ceph_osd_flag_norebalance 0.0

    加入Prometheus监控

      - job_name: 'ceph-mgr'
        static_configs:
        - targets: ['82.157.142.150:7002']

    1656257342485.png

    reload Prometheus,自行重启Prometheus

    查看效果
    1656257273021.png

    AlertManager告警

    监控项设置完了,我们设置一下ceph的告警规则
    alertmanager搭建可以看下面的文章

    AlertManager 微信告警配置

    新闻联播老司机

  • 22年2月14日
  • 喜欢:2
  • 浏览:2.3k
  • [root@prometheus ~]# vim /etc/prometheus/rules/ceph_exporter.yaml
    
    groups:
        - name: Ceph status
          rules:
          - alert: Ceph 实例不健康
            expr: ceph_health_status != 0
            for: 0m
            labels:
              severity: critical
            annotations:
              summary: Ceph 实例不健康{{ $labels.instance }})
              description: "Ceph instance unhealthyn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
          - alert: 检测到Ceph监视器时钟偏差
            expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Ceph monitor clock skew (instance {{ $labels.instance }})
              description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settingsn  VALUE = {{ $value     }}n  LABELS = {{ $labels }}"
          - alert: Ceph监视器存储空间不足
            expr: ceph_monitor_avail_percent < 10
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Ceph monitor low space (instance {{ $labels.instance }})
              description: "Ceph monitor storage is low.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
          - alert: Ceph对象存储守护进程关闭
            expr: ceph_osd_up == 0
            for: 0m
            labels:
              severity: critical
            annotations:
              summary: Ceph OSD Down (instance {{ $labels.instance }})
              description: "Ceph Object Storage Daemon Downn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
    
          - alert: Ceph高OSD延迟
            expr: ceph_osd_perf_apply_latency_seconds > 5
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: Ceph high OSD latency (instance {{ $labels.instance }})
              description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.n  VALUE     = {{ $value }}n  LABELS = {{ $labels }}"
    
          - alert: CephOSD空间不足
            expr: ceph_osd_utilization > 90
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Ceph OSD low space (instance {{ $labels.instance }})
              description: "Ceph Object Storage Daemon is going out of space. Please add more disks.n  VALUE = {{ $value }}n      LABELS = {{ $labels }}"
    
          - alert: CephOSD重新加权
            expr: ceph_osd_weight < 1
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Ceph OSD reweighted (instance {{ $labels.instance }})
              description: "Ceph Object Storage Daemon takes too much time to resize.n  VALUE = {{ $value }}n  LABELS = {{     $labels }}"
          - alert: CephPG下降
            expr: ceph_pg_down > 0
            for: 0m
            labels:
              severity: critical
            annotations:
              summary: Ceph PG down (instance {{ $labels.instance }})
              description: "Some Ceph placement groups are down. Please ensure that all the data are available.n  VALUE = {{     $value }}n  LABELS = {{ $labels }}"
          - alert: CephPG不完整
            expr: ceph_pg_incomplete > 0
            for: 0m
            labels:
              severity: critical
            annotations:
              summary: Ceph PG incomplete (instance {{ $labels.instance }})
              description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.n  VALUE =     {{ $value }}n  LABELS = {{ $labels }}"
          - alert: CephPG不一致
            expr: ceph_pg_inconsistent > 0
            for: 0m
            labels:
              severity: warning
            annotations:
              summary: Ceph PG inconsistent (instance {{ $labels.instance }})
              description: "Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.n  VALUE     = {{ $value }}n  LABELS = {{ $labels }}"
          - alert: CephPG激活时间长
            expr: ceph_pg_activating > 0
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Ceph PG activation long (instance {{ $labels.instance }})
              description: "Some Ceph placement groups are too long to activate.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"
          - alert: Ceph PG回填已满
            expr: ceph_pg_backfill_toofull > 0
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: Ceph PG backfill full (instance {{ $labels.instance }})
              description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be     unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.n  VALUE = {{ $value }}n  LABELS     = {{ $labels }}"
          - alert: Ceph PG不可用
            expr: ceph_pg_total - ceph_pg_active > 0
            for: 0m
            labels:
              severity: critical
            annotations:
              summary: Ceph PG unavailable (instance {{ $labels.instance }})
              description: "Some Ceph placement groups are unavailable.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"

    重启Prometheus和alertmanager

    [root@prometheus ~]# docker restart prometheus_new
    prometheus_new
    [root@prometheus ~]# docker restart alertmanager
    alertmanager

    进入Prometheus就可以看到alert,看到ceph 告警信息
    1656257984391.png

    Grafana

    Grafana Docker安装

    Prometheus 监控MySQL数据库

    新闻联播老司机

  • 20年6月9日
  • 喜欢:0
  • 浏览:4k
  • Grafana上传模板

    #我提供的grafana下载地址
    wget https://d.frps.cn/file/tools/grafana/ceph/ceph-cluster_rev1.json
    #将这个json下载下来
    
    #windows下载
    https://d.frps.cn/?tools/grafana/ceph

    Prometheus监控Ceph集群并设置AlertManager告警-每日运维
    手动上传jsonPrometheus监控Ceph集群并设置AlertManager告警-每日运维
    或者使用这个id2842https://grafana.com/grafana/dashboards/2842
    9966也可以使用
    1656259187007.png1656259223850.png

    告警测试

    我们将ceph-02重启,检查是否可以告警Prometheus监控Ceph集群并设置AlertManager告警-每日运维
    ceph状态已经异常

    [root@ceph-01 ~]# ceph -s
      cluster:
        id:     c8ae7537-8693-40df-8943-733f82049642
        health: HEALTH_WARN
                1 filesystem is degraded
                insufficient standby MDS daemons available
                1 osds down
                1 host (1 osds) down
                Degraded data redundancy: 3803/11409 objects degraded (33.333%), 330 pgs degraded
                1/3 mons down, quorum ceph-01,ceph-03
    
      services:
        mon: 3 daemons, quorum ceph-01,ceph-03 (age 16s), out of quorum: ceph-02
        mgr: ceph-03(active, since 40m), standbys: ceph-01, ceph-02
        mds: cephfs-abcdocker:1 cephfs:1/1 {cephfs-abcdocker:0=ceph-01=up:active,cephfs:0=ceph-03=up:replay}
        osd: 4 osds: 3 up (since 15s), 4 in (since 3w)
        rgw: 1 daemon active (ceph-01)
    
      task status:
    
      data:
        pools:   13 pools, 656 pgs
        objects: 3.80k objects, 13 GiB
        usage:   41 GiB used, 139 GiB / 180 GiB avail
        pgs:     3803/11409 objects degraded (33.333%)
                 330 active+undersized+degraded
                 326 active+undersized
    
      io:
        client:   115 KiB/s wr, 0 op/s rd, 3 op/s wr

    Prometheus 已经提示告警
    Prometheus监控Ceph集群并设置AlertManager告警-每日运维
    等待alertmanager发送告警内容

    具体发送内容,自行优化即可

    1656259510614.png

    相关文章:

    1. Ceph-deploy 快速部署Ceph集群
    2. Ceph OSD扩容与缩容
    3. Ceph集群日常使用命令
    4. Ceph RBD 备份与恢复