我在kubernetes中设置了prometheus、alertmanager(单集群)、pushgateway、webhook-service。
我在下面提到了配置的详细信息和日志,但我很抱歉这个大消息。
我是普罗米修斯工具的新手。 您能否帮我解释一下,为什么当 alertmanager 成功通知 webhook-service 警报时,alertmanager_cluster_messages_queued 指标值会不断增加?
Webhook 服务网址:http://webhook-svc:8085/event/webhook
有服务 service1、service2、service3 使用 pushgateway 每 1 分钟将指标推送到 prometheus,因为服务的工作周期很短。
有警报规则 - 基于以下条件触发的警报
- 如果过去 5 分钟没有指标值,则触发状态触发警报。
- 如果服务恢复并推送过去 5 分钟的指标值,则触发状态已解决的警报
警报规则有多个条件,因为我们想要稳定性过滤器,警报规则在下面的配置中提到,
以下是 prometheus 和 altermanager 的配置:
普罗米修斯配置图:
{
"alerting_rules.yml": "groups:
# All events ids are mapped to event name in webhook service
# 30 is event id of ESET_FORECASTS_NOT_PUBLISHED
- alert: 30
expr: rate(service_forecasts_published_counter{job=\"service_metrics_job\", module_name=\"service1\"}[5m]) <= 0 or changes(service_forecasts_published_counter{job=\"service_metrics_job\"}[5m]) < 4 and on(instance) max_over_time(ALERTS{alertname=\"30\",alertstate=\"firing\",job=\"service_metrics_job\"}[5m]) == 1
for: 1m
# 60 is event id of ESET_SCHEDULES_NOT_PUBLISHED
- alert: 60
expr: rate(service_opt_schedules_published_counter{job=\"service_metrics_job\", module_name=\"service2\"}[5m]) <= 0 or changes(service_opt_schedules_published_counter{job=\"service_metrics_job\"}[5m]) < 4 and on(instance) max_over_time(ALERTS{alertname=\"60\",alertstate=\"firing\",job=\"service_metrics_job\"}[5m]) == 1
for: 1m
# 80 is event id of ESET_NO_SCHEDULES_RECEIVED
- alert: 80
expr: rate(service_opt_schedules_received_counter{job=\"service_metrics_job\", module_name=\"service3\"}[5m]) <= 0 or changes(service_opt_schedules_received_counter{job=\"service_metrics_job\"}[5m]) < 4 and on(instance) max_over_time(ALERTS{alertname=\"80\",alertstate=\"firing\",job=\"service_metrics_job\"}[5m]) == 1
for: 1m
# 90 is event id of ESET_COMMANDS_NOT_PUBLISHED
- alert: 90
expr: rate(service_commands_issued_counter{job=\"service_metrics_job\", module_name=\"service3\"}[5m]) <= 0 or changes(service_commands_issued_counter{job=\"service_metrics_job\"}[5m]) < 4 and on(instance) max_over_time(ALERTS{alertname=\"90\",alertstate=\"firing\",job=\"service_metrics_job\"}[5m]) == 1
for: 1m
",
"prometheus.yml": "global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- honor_labels: true
job_name: pushgateway
static_configs:
- targets:
- prometheus-pushgateway:9091
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- prometheus-alertmanager:9093
警报管理器配置图:
{
"alertmanager.yml": "global: {}
route:
receiver: slack # Fallback.
group_wait: 30s
group_interval: 5m
group_by: ['job', 'instance']
routes:
- match:
severity: page
receiver: slack
continue: true
- match_re:
job: service_metrics_job
receiver: webhook
receivers:
- name: webhook
webhook_configs:
- send_resolved: true
url: 'http://webhook-svc:8085/event/webhook'
- name: pagerduty
pagerduty_configs:
- service_key: d1a80e7400dd432ca3b4bab6d46c306d
"
}
Alertmanager 收到来自 Prometheus 的警报并通知 webhook 服务。
Alertmanager 收到来自 Prometheus 日志的警报:
level=debug ts=2022-02-09T05:50:58.193Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=130[0b8a234][active]
level=debug ts=2022-02-09T05:51:28.193Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{job=~\"^(?:service_metrics_job)$\"}:{instance=\"service1\", job=\"service_metrics_job\"}" msg=flushing alerts=[130[0b8a234][active]]
level=debug ts=2022-02-09T05:51:30.165Z caller=notify.go:685 component=dispatcher receiver=webhook integration=webhook[0] msg="Notify success" attempts=1
level=debug ts=2022-02-09T05:51:58.191Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=80[deb2ec8][active]
level=debug ts=2022-02-09T05:52:28.191Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{job=~\"^(?:service_metrics_job)$\"}:{instance=\"service3\", job=\"service_metrics_job\"}" msg=flushing alerts=[80[deb2ec8][active]]
level=debug ts=2022-02-09T05:52:29.069Z caller=notify.go:685 component=dispatcher receiver=webhook integration=webhook[0] msg="Notify success" attempts=1
Webhook 服务收到警报日志:
2022-02-09 05:51:28.973 INFO 1 --- [ttp@4564e94b-13] n.g.l.web.EventsController : EventId: 30, Status: firing, Resource: service1
2022-02-09 05:51:28.979 INFO 1 --- [ttp@4564e94b-13] n.g.l.processor.StatusProcessor : Received event from: service1, contents: {"@dto":"Event","timestamp":1644385858183,"resource":"m/service1/s","category":"ESET_FORECASTS_NOT_PUBLISHED","message":"","reconstructed":false}
2022-02-09 05:51:29.575 INFO 1 --- [ttp@4564e94b-13] n.g.l.processor.StatusProcessor : Generating a start alert with start 1644385858183, end 9223372036854775807, resource m/service1 and category ESET_FORECASTS_NOT_PUBLISHED
2022-02-09 05:52:28.564 INFO 1 --- [ttp@4564e94b-19] n.g.l.web.EventsController : EventId: 80, Status: firing, Resource: service3
2022-02-09 05:52:28.564 INFO 1 --- [ttp@4564e94b-19] n.g.l.processor.StatusProcessor : Received event from:service3, contents: {"@dto":"Event","timestamp":1644385918183,"resource":"m/service3/s","category":"ESET_NO_SCHEDULES_RECEIVED","message":"","reconstructed":false}
2022-02-09 05:52:28.982 INFO 1 --- [ttp@4564e94b-19] n.g.l.processor.StatusProcessor : Generating a start alert with start 1644385918183, end 9223372036854775807, resource service3 and category ESET_NO_SCHEDULES_RECEIVED
Prometheus 服务器日志:
level=info ts=2022-02-09T07:00:23.729Z caller=compact.go:507 component=tsdb msg="write block" mint=1644379200000 maxt=1644386400000 ulid=01FVEMH0FXW6FMT82A0TZNSNT2 duration=51.478298ms
level=info ts=2022-02-09T07:00:23.732Z caller=head.go:880 component=tsdb msg="Head GC completed" duration=2.173969ms
level=info ts=2022-02-09T07:00:23.740Z caller=checkpoint.go:95 component=tsdb msg="Creating checkpoint" from_segment=16 to_segment=17 mint=1644386400000
level=info ts=2022-02-09T07:00:23.757Z caller=head.go:977 component=tsdb msg="WAL checkpoint complete" first=16 last=17 duration=16.724111ms
level=info ts=2022-02-09T09:00:33.483Z caller=compact.go:507 component=tsdb msg="write block" mint=1644386423669 maxt=1644393600000 ulid=01FVEVD18EE5NJA055ZC1MQV17 duration=61.065659ms
level=info ts=2022-02-09T09:00:33.486Z caller=head.go:880 component=tsdb msg="Head GC completed" duration=1.699773ms
但是**alertmanager_cluster_messages_queued**
每当警报被通知到 webhook 时,计数就会不断增加。alertmanager_cluster_messages_queued 的增加导致超出最大队列大小 (4096),因为在达到最大队列大小后,一些警报未发送到 webhook-service。