0

鉴于监控 API 网关错误率的 Cloudwatch 警报,Cloudwatch 不会将丢失的数据点视为notBreaching
我想在 5 分钟间隔内错误率 > 25% 时触发警报。
警报详细信息:
周期: 1 分钟
要警报的数据点: 5 个中的 3 个
丢失数据处理:将丢失的数据视为良好(未超出阈值)

我注意到由于以下原因触发了 cloudwatch 警报:

阈值越过:最后 5 个数据点中的 3 个 [100.0 (27/05/21 21:56:00)、100.0 (27/05/21 21:54:00)、100.0 (27/05/21 21:49: 00)] 大于或等于阈值 (25.0),并且 2 个缺失数据点被视为 [NonBreaching](至少 3 个数据点用于 OK -> ALARM 转换)。

我希望每分钟计算一次数据点,即 27/05/21 21:50:00、27/05/21 21:51:00、27/05/21 21:52:00、27/05/21 21: 53:00, 27/05/21 21:55:00 应该标记为 Good。所以最近的 5 个数据点应该是
27/05/21 21: 56 :00 : ALARM
27/05/21 21: 55 :00 : OK (丢失数据为 notBreached)
27/05/21 21: 54 :00 : ALARM
27/05/21 21: 53 :00 : OK (丢失数据为 notBreached)
27/05/21 21: 52 :00 : OK (丢失数据为 notBreached)
在最近的 5 个数据点中,只有 2 个应该处于 ALARM 状态状态和最终结果不应触发警报。
想知道我错过了什么?

地形代码片段:

resource "aws_cloudwatch_metric_alarm" "api_error_spike" {
  alarm_name = "API error rate exceeding threshold"
  alarm_description = "API error rate has exceeded allowed 25% threshold over 5 minutes"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods = "5"
  datapoints_to_alarm = "3" // 3 out of 5 data points should be in ALARM state to trigger alarm
  treat_missing_data = "notBreaching"

  threshold = 25

  metric_query {
    id = "e1"
    expression = "(m1+m2)*100"
    label = "API Error Rate"
    return_data = "true"
  }

  metric_query {
    id = "m1"
    metric {
      metric_name = "5XXError"
      period = "60" // 60 seconds is the lowest precision for standard (in AWS/ namespace) metrics
      stat = "Average" // Average represents Error rate. Sum represents total errors
      unit = "Count"
      namespace = "AWS/ApiGateway"
      dimensions = {
        ApiName = "foo"
      }
    }
  }

  metric_query {
    id = "m2"
    metric {
      metric_name = "4XXError"
      period = "60"
      stat = "Average" // Average represents Error rate. Sum represents total errors
      unit = "Count"
      namespace = "AWS/ApiGateway"
      dimensions = {
        ApiName = "foo"
      }
    }
  }
}
4

0 回答 0