vector与alertmanager的调试日志警报

2023年 8月 10日 174.6k 0

日志告警一直都是一个无法回避的问题,无论是在什么时候,能够掌握程序日志的报错信息是有利于早期发现并定位问题。

而在过去,常用手段可以通过logstash的if判断进行正则匹配,或者通过第三方工具读取ES,再或者通过grafan来进行触发

而在阿里云或者腾讯云中同样也具备日志过滤,并且自带多级处理。

而在传统的ELK中,fluentd也是可以承担这个任务,而在新兴的开源软件中,以上逐渐被慢慢剥离。取而代之的是阿里的ilogtail, 网易的 loggie-io,以及Datadog公司的vector。

vector是由rust编写,在处理和消费速度上优于logstash,我将会分享如何通过vector调试vector处理日志关键字触发告警。在logstash上是可以支持重复日志计数和沉默的,而vector只负责过滤和转发,因此alertmanager可以承担这一个功能

开始之前,我们需要了解alertmanager是如何接受告警的:

alertmanager

安装alertmanager

提供一个config.yml的示例

mkdir /data/alertmanager -p
cat > /data/alertmanager/config.yml << EOF
global:
  resolve_timeout: 5m
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 24h
  receiver: email
  routes:
  - receiver: 'webhooke'
    group_by: ['alertname', 'instance']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 24h      
    match:
      severity: 'critical'
  - receiver: 'webhookw'
    group_by: ['alertname', 'instance']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 24h
    match:
      severity: '~(warning)$'
receivers:
- name: 'webhookw'
  webhook_configs:
  - send_resolved: true
    url: 'http://webhook-dingtalk:8060/dingtalk/webhookw/send'
- name: 'webhooke'
  webhook_configs:
  - send_resolved: true
    url: 'http://webhook-dingtalk:8060/dingtalk/webhooke/send'
inhibit_rules:
  - source_match:
      alertname: node_host_lost,PodMemoryUsage
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['ltype']
EOF

docker-compose

version: "2.2"

services:
  kafka:
    container_name: alertmanager
    restart: always
    image: registry.cn-zhangjiakou.aliyuncs.com/marksugar-k8s/alertmanager:v0.24.0
    ports:
      - "9093:9093"
    volumes:
    - /etc/localtime:/etc/localtime:ro  # 时区2    
    - /data/alertmanager/config.yml:/etc/alertmanager/config.yml  # chmod 777 -R /data/kafka
    environment:
    - ALLOW_PLAINTEXT_LISTENER=yes
    logging:
      driver: "json-file"
      options:
        max-size: "100M"
    mem_limit: 4096m

想要发送到alertmanager,我们需要符合的格式,如下

[
    {
        "labels": {
            "alertname": "name1",
               "dev": "sda1",
               "instance": "example3",
               "severity": "warning"
        }
    }
]

如下

alerts1='[
    {
        "labels": {
            "alertname": "name1",
               "dev": "sda1",
               "instance": "example3",
               "severity": "warning"
        }
    }
]'
curl -XPOST -d"$alerts1" http://172.16.100.151:9093/api/v1/alerts

返回success

[root@master-01 /var/log]# curl -XPOST -d"$alerts1" http://172.16.100.151:9093/api/v1/alerts
{"status":"success"}

可以在界面查看

image-20230808214028620.png

vector

alertmanager了解之后,我们按照官方的配置拿到如下信息,并且进行调试:

  • 配置说明

[sources.filetest] : 数据来源

[transforms.ftest]: 数据处理

[transforms.remap_alert_udev]: ramap数据,相当于此前logstash的grok,比grok功能强大

condition = "match!(.message, r'.*WebApplicationContext*.')" 过滤包含WebApplicationContext的关键字的日志

而后将日志格式为json,重新组合为alertmanager的数据格式

source = """
. = parse_json!(.message) 
. = [
  {
      "labels": {
          "alertname": .fields.podname,
          "namespace": .fields.namespace,
          "environment": .fields.environment,
          "podname": .fields.podname,
          "nodename": .fields.nodename,
          "topic": .fields.topic,
          "body": .body,
          "severity": "critical"
      }
  }
]
"""

用于调试打印

[sinks.sink0]
inputs = ["remap_alert_*"]
target = "stdout"
type = "console"
[sinks.sink0.encoding]
codec = "json"

用于发送alertmanager

[sinks.alertmanager]
type = "http"
inputs = ["remap_alert_*"]
uri = "http://172.16.100.151:9093/api/v1/alerts"
compression = "none"
encoding.codec = "json"
acknowledgements.enabled = true

vector.toml最终如下

[api]
enabled = true
address = "0.0.0.0:8686"

[sources.filetest]
type = "file"
include = ["/var/log/test.log"]

[transforms.ftest]
type = "filter"
inputs = ["filetest"]
condition = "match!(.message, r'.*WebApplicationContext*.')"


[transforms.remap_alert_udev]
type = "remap"
inputs = ["ftest"]
source = """
. = parse_json!(.message) 
. = [
  {
      "labels": {
          "alertname": .fields.podname,
          "namespace": .fields.namespace,
          "environment": .fields.environment,
          "podname": .fields.podname,
          "nodename": .fields.nodename,
          "topic": .fields.topic,
          "body": .body,
          "severity": "critical"
      }
  }
]
"""


[sinks.sink0]
inputs = ["remap_alert_*"]
target = "stdout"
type = "console"
[sinks.sink0.encoding]
codec = "json"

[sinks.alertmanager]
type = "http"
inputs = ["remap_alert_*"]
uri = "http://172.16.100.151:9093/api/v1/alerts"
compression = "none"
encoding.codec = "json"
acknowledgements.enabled = true

启动 vector

[root@master-01 ~/vector]#  vector -c vector.toml
2023-08-05T06:42:30.336918Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-08-05T06:42:30.337720Z  INFO vector::app: Loading configs. paths=["vector.toml"]
2023-08-05T06:42:30.355841Z  INFO vector::topology::running: Running healthchecks.
2023-08-05T06:42:30.355886Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355907Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355930Z  INFO vector: Vector has started. debug="false" version="0.31.0" arch="x86_64" revision="0f13b22 2023-07-06 13:52:34.591204470"
2023-08-05T06:42:30.355940Z  INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}: vector::sources::file: Starting file server. include=["/var/log/test.log"] exclude=[]
2023-08-05T06:42:30.356284Z  INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: file_source::checkpointer: Loaded checkpoint data.
2023-08-05T06:42:30.356411Z  INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/test.log file_position=4068
2023-08-05T06:42:30.356959Z  INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=http://0.0.0.0:8686/playground

手动 追加一条信息

[root@master-01 ~]# echo '{"body":"2023-08-02T00:18:34.866228161+08:00 stdouts.b.w.embedded.tomcat.TomcatWebServer WebApplicationContext","fields":{"containername":"java-demo","environment":"dev","logconfig":"java-demo","namespace":"linuxea-dev","nodename":"172.16.100.83","podname":"production-java-demo-5cf5b97645-tsmxx","topic":"java-demo"}}' >> /var/log/test.log

如果没有问题,这里 将会将日志打印到console,并且会发送到alertmanager

[root@master-01 ~/vector]#  vector -c vector.toml
2023-08-05T06:42:30.336918Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-08-05T06:42:30.337720Z  INFO vector::app: Loading configs. paths=["vector.toml"]
2023-08-05T06:42:30.355841Z  INFO vector::topology::running: Running healthchecks.
2023-08-05T06:42:30.355886Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355907Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355930Z  INFO vector: Vector has started. debug="false" version="0.31.0" arch="x86_64" revision="0f13b22 2023-07-06 13:52:34.591204470"
2023-08-05T06:42:30.355940Z  INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}: vector::sources::file: Starting file server. include=["/var/log/test.log"] exclude=[]
2023-08-05T06:42:30.356284Z  INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: file_source::checkpointer: Loaded checkpoint data.
2023-08-05T06:42:30.356411Z  INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/test.log file_position=4068
2023-08-05T06:42:30.356959Z  INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=http://0.0.0.0:8686/playground


{"labels":{"alertname":"production-java-demo-5cf5b97645-tsmxx","body":"2023-08-02T00:18:34.866228161+08:00 stdouts.b.w.embedded.tomcat.TomcatWebServer WebApplicationContext","environment":"dev","namespace":"linuxea-dev","nodename":"172.16.100.83","podname":"production-java-demo-5cf5b97645-tsmxx","severity":"critical","topic":"java-demo"}}

image-20230805144825228.png

alertmanager已经收到一个匹配到的日志

image-20230805144421730.png

警报已经被发送到alertmanager,接着你可以用它发往任何地方。

相关文章

对接alertmanager创建钉钉卡片(1)
手把手教你搭建OpenFalcon监控系统
无需任何魔法即可使用 Ansible 的神奇变量“hostvars”
openobseve HA本地单集群模式
基于k8s上loggie/vector/openobserve日志收集
openobseve单节点和查询语法

发布评论