我正在尝试创建一个观察者警报,当节点上的某个进程在过去一小时内使用超过 0.95% 的 CPU 时将触发该警报。
这是我的配置示例:
{
"trigger": {
"schedule": {
"interval": "10m"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"metricbeat*"
],
"types": [],
"body": {
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"system.process.cpu.total.norm.pct": {
"gte": 0.95
}
}
},
{
"range": {
"system.process.cpu.start_time": {
"gte": "now-1h"
}
}
},
{
"match": {
"environment": "test"
}
}
]
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gt": 0
}
}
},
"actions": {
"send-to-slack": {
"throttle_period_in_millis": 1800000,
"webhook": {
"scheme": "https",
"host": "hooks.slack.com",
"port": 443,
"method": "post",
"path": "{{ctx.metadata.onovozhylov-test}}",
"params": {},
"headers": {
"Content-Type": "application/json"
},
"body": "{ \"text\": \" ==========\nTest parameters:\n\tthrottle_period_in_millis: 60000\n\tInterval: 1m\n\tcpu.total.norm.pct: 0.5\n\tcpu.start_time: now-1m\n\nThe watcher:*{{ctx.watch_id}}* in env:*{{ctx.metadata.env}}* found that the process *{{ctx.system.process.name}}* has been utilizing CPU over 95% for the past 1 hr on node:\n{{#ctx.payload.nodes}}\t{{.}}\n\n{{/ctx.payload.nodes}}\n\nThe runbook entry is here: *{{ctx.metadata.runbook}}* \"}"
}
}
},
"metadata": {
"onovozhylov-test": "/services/T0U0CFMT4/BBK1A2AAH/MlHAF2QuPjGZV95dvO11111111",
"env": "{{ grains.get('environment') }}",
"runbook": "http://mytest.com"
}
}
当我设置 metric 时,这个 Watcher 不起作用system.process.cpu.start_time
。也许这个指标不正确......不幸的是,我没有与 Watcher 相关的经验来自己解决这个问题。
另一个问题是我不知道如何将其添加system.process.name
到消息正文中。
提前感谢您的帮助!