3

I am trying to create an alert in DataDog that would alert us when disk performance slows down our machines.

As a business requirement I would say that if the IO is almost saturated (over 90%) for more than 30 minutes, the alert should be triggered.

Here are the current set of metrics that are recorded: sys.cpu.iowait system.io.avg_q_sz system.io.avg_rq_sz system.io.await system.io.r_await system.io.r_s system.io.rkb_s system.io.rrqm_s system.io.svctm system.io.util system.io.w_await system.io.w_s system.io.wkb_s system.io.wrqm_s enter image description here

It is possible to use any formulas to combine these, including SUM and AVG values.

4

1 回答 1

10

这些 system.io 指标是从后台使用的系统代理检查iostat报告的。

根据iostat 手册页,其中一个指标%utilsystem.io.util在 Datadog 中报告)似乎可以完成这项工作:

%util:向设备发出 I/O 请求的 CPU 时间百分比(设备的带宽利用率)。当该值接近 100% 时,会发生设备饱和。

您可以创建一个监视器,作为主机/设备上的多警报,当此指标在过去 30 分钟内平均超过 90 时,以下是此类示例的当前屏幕截图:

Datadog 中的示例监视器

当然,也可以监控其他 iostat 指标来识别其他 I/O 性能故障模式。

于 2016-02-19T15:47:27.350 回答