I am trying to create an alert in DataDog that would alert us when disk performance slows down our machines.
As a business requirement I would say that if the IO is almost saturated (over 90%) for more than 30 minutes, the alert should be triggered.
Here are the current set of metrics that are recorded:
sys.cpu.iowait
system.io.avg_q_sz
system.io.avg_rq_sz
system.io.await
system.io.r_await
system.io.r_s
system.io.rkb_s
system.io.rrqm_s
system.io.svctm
system.io.util
system.io.w_await
system.io.w_s
system.io.wkb_s
system.io.wrqm_s
It is possible to use any formulas to combine these, including SUM and AVG values.