鉴于监控 API 网关错误率的 Cloudwatch 警报,Cloudwatch 不会将丢失的数据点视为notBreaching。
我想在 5 分钟间隔内错误率 > 25% 时触发警报。
警报详细信息:
周期: 1 分钟
要警报的数据点: 5 个中的 3 个
丢失数据处理:将丢失的数据视为良好(未超出阈值)
我注意到由于以下原因触发了 cloudwatch 警报:
阈值越过:最后 5 个数据点中的 3 个 [100.0 (27/05/21 21:56:00)、100.0 (27/05/21 21:54:00)、100.0 (27/05/21 21:49: 00)] 大于或等于阈值 (25.0),并且 2 个缺失数据点被视为 [NonBreaching](至少 3 个数据点用于 OK -> ALARM 转换)。
我希望每分钟计算一次数据点,即 27/05/21 21:50:00、27/05/21 21:51:00、27/05/21 21:52:00、27/05/21 21: 53:00, 27/05/21 21:55:00 应该标记为 Good。所以最近的 5 个数据点应该是
27/05/21 21: 56 :00 : ALARM
27/05/21 21: 55 :00 : OK (丢失数据为 notBreached)
27/05/21 21: 54 :00 : ALARM
27/05/21 21: 53 :00 : OK (丢失数据为 notBreached)
27/05/21 21: 52 :00 : OK (丢失数据为 notBreached)
在最近的 5 个数据点中,只有 2 个应该处于 ALARM 状态状态和最终结果不应触发警报。
想知道我错过了什么?
地形代码片段:
resource "aws_cloudwatch_metric_alarm" "api_error_spike" {
alarm_name = "API error rate exceeding threshold"
alarm_description = "API error rate has exceeded allowed 25% threshold over 5 minutes"
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "5"
datapoints_to_alarm = "3" // 3 out of 5 data points should be in ALARM state to trigger alarm
treat_missing_data = "notBreaching"
threshold = 25
metric_query {
id = "e1"
expression = "(m1+m2)*100"
label = "API Error Rate"
return_data = "true"
}
metric_query {
id = "m1"
metric {
metric_name = "5XXError"
period = "60" // 60 seconds is the lowest precision for standard (in AWS/ namespace) metrics
stat = "Average" // Average represents Error rate. Sum represents total errors
unit = "Count"
namespace = "AWS/ApiGateway"
dimensions = {
ApiName = "foo"
}
}
}
metric_query {
id = "m2"
metric {
metric_name = "4XXError"
period = "60"
stat = "Average" // Average represents Error rate. Sum represents total errors
unit = "Count"
namespace = "AWS/ApiGateway"
dimensions = {
ApiName = "foo"
}
}
}
}