日志告警一直都是一个无法回避的问题,无论是在什么时候,能够掌握程序日志的报错信息是有利于早期发现并定位问题。
而在过去,常用手段可以通过logstash的if判断进行正则匹配,或者通过第三方工具读取ES,再或者通过grafan来进行触发
而在阿里云或者腾讯云中同样也具备日志过滤,并且自带多级处理。
而在传统的ELK中,fluentd也是可以承担这个任务,而在新兴的开源软件中,以上逐渐被慢慢剥离。取而代之的是阿里的ilogtail, 网易的 loggie-io,以及Datadog公司的vector。
vector是由rust编写,在处理和消费速度上优于logstash,我将会分享如何通过vector调试vector处理日志关键字触发告警。在logstash上是可以支持重复日志计数和沉默的,而vector只负责过滤和转发,因此alertmanager可以承担这一个功能
开始之前,我们需要了解alertmanager是如何接受告警的:
alertmanager
安装alertmanager
提供一个config.yml的示例
mkdir /data/alertmanager -p
cat > /data/alertmanager/config.yml << EOF
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 24h
receiver: email
routes:
- receiver: 'webhooke'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 24h
match:
severity: 'critical'
- receiver: 'webhookw'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 24h
match:
severity: '~(warning)$'
receivers:
- name: 'webhookw'
webhook_configs:
- send_resolved: true
url: 'http://webhook-dingtalk:8060/dingtalk/webhookw/send'
- name: 'webhooke'
webhook_configs:
- send_resolved: true
url: 'http://webhook-dingtalk:8060/dingtalk/webhooke/send'
inhibit_rules:
- source_match:
alertname: node_host_lost,PodMemoryUsage
severity: 'critical'
target_match:
severity: 'warning'
equal: ['ltype']
EOF
docker-compose
version: "2.2"
services:
kafka:
container_name: alertmanager
restart: always
image: registry.cn-zhangjiakou.aliyuncs.com/marksugar-k8s/alertmanager:v0.24.0
ports:
- "9093:9093"
volumes:
- /etc/localtime:/etc/localtime:ro # 时区2
- /data/alertmanager/config.yml:/etc/alertmanager/config.yml # chmod 777 -R /data/kafka
environment:
- ALLOW_PLAINTEXT_LISTENER=yes
logging:
driver: "json-file"
options:
max-size: "100M"
mem_limit: 4096m
想要发送到alertmanager,我们需要符合的格式,如下
[
{
"labels": {
"alertname": "name1",
"dev": "sda1",
"instance": "example3",
"severity": "warning"
}
}
]
如下
alerts1='[
{
"labels": {
"alertname": "name1",
"dev": "sda1",
"instance": "example3",
"severity": "warning"
}
}
]'
curl -XPOST -d"$alerts1" http://172.16.100.151:9093/api/v1/alerts
返回success
[root@master-01 /var/log]# curl -XPOST -d"$alerts1" http://172.16.100.151:9093/api/v1/alerts
{"status":"success"}
可以在界面查看
vector
alertmanager了解之后,我们按照官方的配置拿到如下信息,并且进行调试:
- 配置说明
[sources.filetest] : 数据来源
[transforms.ftest]: 数据处理
[transforms.remap_alert_udev]: ramap数据,相当于此前logstash的grok,比grok功能强大
condition = "match!(.message, r'.*WebApplicationContext*.')"
过滤包含WebApplicationContext的关键字的日志而后将日志格式为json,重新组合为alertmanager的数据格式
source = """ . = parse_json!(.message) . = [ { "labels": { "alertname": .fields.podname, "namespace": .fields.namespace, "environment": .fields.environment, "podname": .fields.podname, "nodename": .fields.nodename, "topic": .fields.topic, "body": .body, "severity": "critical" } } ] """
用于调试打印
[sinks.sink0] inputs = ["remap_alert_*"] target = "stdout" type = "console" [sinks.sink0.encoding] codec = "json"
用于发送alertmanager
[sinks.alertmanager] type = "http" inputs = ["remap_alert_*"] uri = "http://172.16.100.151:9093/api/v1/alerts" compression = "none" encoding.codec = "json" acknowledgements.enabled = true
vector.toml最终如下
[api]
enabled = true
address = "0.0.0.0:8686"
[sources.filetest]
type = "file"
include = ["/var/log/test.log"]
[transforms.ftest]
type = "filter"
inputs = ["filetest"]
condition = "match!(.message, r'.*WebApplicationContext*.')"
[transforms.remap_alert_udev]
type = "remap"
inputs = ["ftest"]
source = """
. = parse_json!(.message)
. = [
{
"labels": {
"alertname": .fields.podname,
"namespace": .fields.namespace,
"environment": .fields.environment,
"podname": .fields.podname,
"nodename": .fields.nodename,
"topic": .fields.topic,
"body": .body,
"severity": "critical"
}
}
]
"""
[sinks.sink0]
inputs = ["remap_alert_*"]
target = "stdout"
type = "console"
[sinks.sink0.encoding]
codec = "json"
[sinks.alertmanager]
type = "http"
inputs = ["remap_alert_*"]
uri = "http://172.16.100.151:9093/api/v1/alerts"
compression = "none"
encoding.codec = "json"
acknowledgements.enabled = true
启动 vector
[root@master-01 ~/vector]# vector -c vector.toml
2023-08-05T06:42:30.336918Z INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-08-05T06:42:30.337720Z INFO vector::app: Loading configs. paths=["vector.toml"]
2023-08-05T06:42:30.355841Z INFO vector::topology::running: Running healthchecks.
2023-08-05T06:42:30.355886Z INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355907Z INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355930Z INFO vector: Vector has started. debug="false" version="0.31.0" arch="x86_64" revision="0f13b22 2023-07-06 13:52:34.591204470"
2023-08-05T06:42:30.355940Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}: vector::sources::file: Starting file server. include=["/var/log/test.log"] exclude=[]
2023-08-05T06:42:30.356284Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: file_source::checkpointer: Loaded checkpoint data.
2023-08-05T06:42:30.356411Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/test.log file_position=4068
2023-08-05T06:42:30.356959Z INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=http://0.0.0.0:8686/playground
手动 追加一条信息
[root@master-01 ~]# echo '{"body":"2023-08-02T00:18:34.866228161+08:00 stdouts.b.w.embedded.tomcat.TomcatWebServer WebApplicationContext","fields":{"containername":"java-demo","environment":"dev","logconfig":"java-demo","namespace":"linuxea-dev","nodename":"172.16.100.83","podname":"production-java-demo-5cf5b97645-tsmxx","topic":"java-demo"}}' >> /var/log/test.log
如果没有问题,这里 将会将日志打印到console,并且会发送到alertmanager
[root@master-01 ~/vector]# vector -c vector.toml
2023-08-05T06:42:30.336918Z INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-08-05T06:42:30.337720Z INFO vector::app: Loading configs. paths=["vector.toml"]
2023-08-05T06:42:30.355841Z INFO vector::topology::running: Running healthchecks.
2023-08-05T06:42:30.355886Z INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355907Z INFO vector::topology::builder: Healthcheck passed.
2023-08-05T06:42:30.355930Z INFO vector: Vector has started. debug="false" version="0.31.0" arch="x86_64" revision="0f13b22 2023-07-06 13:52:34.591204470"
2023-08-05T06:42:30.355940Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}: vector::sources::file: Starting file server. include=["/var/log/test.log"] exclude=[]
2023-08-05T06:42:30.356284Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: file_source::checkpointer: Loaded checkpoint data.
2023-08-05T06:42:30.356411Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/test.log file_position=4068
2023-08-05T06:42:30.356959Z INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=http://0.0.0.0:8686/playground
{"labels":{"alertname":"production-java-demo-5cf5b97645-tsmxx","body":"2023-08-02T00:18:34.866228161+08:00 stdouts.b.w.embedded.tomcat.TomcatWebServer WebApplicationContext","environment":"dev","namespace":"linuxea-dev","nodename":"172.16.100.83","podname":"production-java-demo-5cf5b97645-tsmxx","severity":"critical","topic":"java-demo"}}
alertmanager已经收到一个匹配到的日志
警报已经被发送到alertmanager,接着你可以用它发往任何地方。