amazon-web-services - 向 EC2 自动扩展组添加容量的 AWS CloudWatch 警报一直处于警报状态

Question

当内存预留大于 70% 时，我设置了 CloudWatch 警报以向 EC2 自动扩展组添加 1 个容量单位。警报是在正确的时刻触发的，但它已经警报了 16 小时以上，EC2 自动扩展组中没有任何变化。可能出了什么问题？

这是我的 ECS CloudFormation 模板：

ECSCluster:
  Type: AWS::ECS::Cluster
  Properties:
    ClusterName: !Ref EnvironmentName

ECSAutoScalingGroup:
  DependsOn: ECSCluster
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    VPCZoneIdentifier: !Ref Subnets
    LaunchConfigurationName: !Ref ECSLaunchConfiguration
    MinSize: !Ref ClusterMinSize
    MaxSize: !Ref ClusterMaxSize
    DesiredCapacity: !Ref ClusterDesiredCapacity
  CreationPolicy:
    ResourceSignal:
      Timeout: PT15M
  UpdatePolicy:
    AutoScalingRollingUpdate:
      MinInstancesInService: 1
      MaxBatchSize: 1
      PauseTime: PT15M
      SuspendProcesses:
        - HealthCheck
        - ReplaceUnhealthy
        - AZRebalance
        - AlarmNotification
        - ScheduledActions
      WaitOnResourceSignals: true

ScaleUpPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '1'

MemoryReservationAlarmHigh:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '70'
    AlarmDescription: Alarm if Cluster Memory Reservation is too high
    Period: '60'
    AlarmActions:
    - Ref: ScaleUpPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: GreaterThanThreshold
    MetricName: MemoryReservation

ScaleDownPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '-1'

MemoryReservationAlarmLow:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '30'
    AlarmDescription: Alarm if Cluster Memory Reservation is too Low
    Period: '60'
    AlarmActions:
    - Ref: ScaleDownPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: LessThanThreshold
    MetricName: MemoryReservation

ECSLaunchConfiguration:
  Type: AWS::AutoScaling::LaunchConfiguration
  Properties:
    KeyName: !If [IsProd, !Ref 'AWS::NoValue', !Ref KeyName]
    ImageId: !Ref ECSAMI
    InstanceType: !Ref InstanceType
    SecurityGroups:
      - !Ref SecurityGroup
    IamInstanceProfile: !Ref ECSInstanceProfile
    UserData:
      "Fn::Base64": !Sub |
        #!/bin/bash
        source /etc/profile.d/proxy.sh
        yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
        yum install -y https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
        yum install -y aws-cfn-bootstrap hibagent
        cat >> /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml <<EOF
        [proxy]
            http_proxy="${!http_proxy}"
            https_proxy="${!https_proxy}"
            no_proxy="${!no_proxy}"
        EOF
        /opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
        /opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
        /usr/bin/enable-ec2-spot-hibernation

  Metadata:
    AWS::CloudFormation::Init:
      config:
        packages:
          yum:
            collectd: []

        commands:
          01_add_instance_to_cluster:
            command: !Sub echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
          02_enable_cloudwatch_agent:
            command: !Sub /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:${ECSCloudWatchParameter} -s
        files:
          /etc/cfn/cfn-hup.conf:
            mode: 000400
            owner: root
            group: root
            content: !Sub |
              [main]
              stack=${AWS::StackId}
              region=${AWS::Region}

          /etc/cfn/hooks.d/cfn-auto-reloader.conf:
            content: !Sub |
              [cfn-auto-reloader-hook]
              triggers=post.update
              path=Resources.ECSLaunchConfiguration.Metadata.AWS::CloudFormation::Init
              action=/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration

        services:
          sysvinit:
            cfn-hup:
              enabled: true
              ensureRunning: true
              files:
                - /etc/cfn/cfn-hup.conf
                - /etc/cfn/hooks.d/cfn-auto-reloader.conf

# This IAM Role is attached to all of the ECS hosts. It is based on the default role
# published here:
# http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
#
# You can add other IAM policy statements here to allow access from your ECS hosts
# to other AWS services. Please note that this role will be used by ALL containers
# running on the ECS host.

ECSRole:
  Type: AWS::IAM::Role
  Properties:
    Path: /
    RoleName: !Sub ${EnvironmentName}-ECSRole-${AWS::Region}
    AssumeRolePolicyDocument: |
      {
          "Statement": [{
              "Action": "sts:AssumeRole",
              "Effect": "Allow",
              "Principal": {
                  "Service": "ec2.amazonaws.com"
              }
          }]
      }
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
      - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
    Policies:
      - PolicyName: ecs-service
        PolicyDocument: |
          {
              "Statement": [{
                  "Effect": "Allow",
                  "Action": [
                      "ecs:CreateCluster",
                      "ecs:DeregisterContainerInstance",
                      "ecs:DiscoverPollEndpoint",
                      "ecs:Poll",
                      "ecs:RegisterContainerInstance",
                      "ecs:StartTelemetrySession",
                      "ecs:Submit*",
                      "ecr:BatchCheckLayerAvailability",
                      "ecr:BatchGetImage",
                      "ecr:GetDownloadUrlForLayer",
                      "ecr:GetAuthorizationToken"
                  ],
                  "Resource": "*"
              }]
          }

ECSInstanceProfile:
  Type: AWS::IAM::InstanceProfile
  Properties:
    Path: /
    Roles:
      - !Ref ECSRole

ECSServiceAutoScalingRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        Action:
          - "sts:AssumeRole"
        Effect: Allow
        Principal:
          Service:
            - application-autoscaling.amazonaws.com
    Path: /
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
    Policies:
      - PolicyName: ecs-service-autoscaling
        PolicyDocument:
          Statement:
            Effect: Allow
            Action:
              - application-autoscaling:*
              - cloudwatch:DescribeAlarms
              - cloudwatch:PutMetricAlarm
              - ecs:DescribeServices
              - ecs:UpdateService
            Resource: "*"

ECSCloudWatchParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: CloudWatch Log configs for ECS cluster
    Name: !Sub AmazonCloudWatch-${ECSCluster}-ECS
    Type: String
    Value: !Sub |
      {
        "logs": {
          "force_flush_interval": 5,
          "logs_collected": {
            "files": {
              "collect_list": [
                {
                  "file_path": "/var/log/messages",
                  "log_group_name": "${ECSCluster}/var/log/messages",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%b %d %H:%M:%S"
                },
                {
                  "file_path": "/var/log/dmesg",
                  "log_group_name": "${ECSCluster}/var/log/dmesg",
                  "log_stream_name": "{instance_id}"
                },
                {
                  "file_path": "/var/log/docker",
                  "log_group_name": "${ECSCluster}/var/log/docker",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%S.%f"
                },
                {
                  "file_path": "/var/log/ecs/ecs-init.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-init.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/ecs-agent.log.*",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-agent.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/audit.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/audit.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                }
              ]
            }
          }
        },
        "metrics": {
          "append_dimensions": {
            "AutoScalingGroupName": "${!aws:AutoScalingGroupName}",
            "InstanceId": "${!aws:InstanceId}",
            "InstanceType": "${!aws:InstanceType}"
          },
          "metrics_collected": {
            "collectd": {
              "metrics_aggregation_interval": 60
            },
            "disk": {
              "measurement": [
                "used_percent"
              ],
              "metrics_collection_interval": 60,
              "resources": [
                "/"
              ]
            },
            "mem": {
              "measurement": [
                "mem_used_percent"
              ],
              "metrics_collection_interval": 60
            },
            "statsd": {
              "metrics_aggregation_interval": 60,
              "metrics_collection_interval": 10,
              "service_address": ":8125"
            }
          }
        }
      }

ECSClusterParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Cluster
    Name: !Sub /${EnvironmentName}/ecs-cluster
    Type: String
    Value: !Ref ECSCluster

ECSServiceAutoScalingRoleParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Service ASG Role
    Name: !Sub /${EnvironmentName}/ecs-service-asg-role
    Type: String
    Value: !GetAtt ECSServiceAutoScalingRole.Arn

警报活动历史记录：

2019-12-26 11:40:54 Action  Successfully executed action arn:aws:autoscaling:ap-southeast-2:031539715286:scalingPolicy:95e836b6-2f56-498d-b931-7ec4184bedc4:autoScalingGroupName/ECS-UEBZA8GAP8S7-ECSAutoScalingGroup-1BIBTJH5I50W9:policyName/ECS-UEBZA8GAP8S7-ScaleUpPolicy-17LUWE42DC7EO
2019-12-26 11:40:54 State update  Alarm updated from OK to In alarm

score 1 · Accepted Answer

确保没有任何进程暂停。警报通知意味着传入警报不会触发扩展策略。发射意味着即使期望上升，也不会发射任何东西

其他可能导致此问题的常见问题：

如果您使用权重并将期望值增加 1，但最低权重不是 1，那么它可能永远无法扩展。
确保没有触发任何其他可能会覆盖此扩展策略的扩展策略
检查活动历史记录以确保没有任何健康检查替换不断发生，因为这将启动 5 分钟的冷却时间（默认设置，因为没有在 ASG 上设置，只有扩展策略），并且会阻止简单的扩展策略
确保所需的尚未达到最大值
除了触发警报之外，请确保您在警报历史记录中看到发生了自动缩放“操作”（该操作实际上每分钟发生一次，警报保持在警报状态，无论您的评估设置如何，但只有第一个被发布到警报历史记录）
检查 ASG Activity 历史中的启动失败，这在使用 Spot 实例时尤其常见，并且 ASG 在失败次数过多后最终会进入退避状态。对组的任何手动更新都将重置此退避

score 0 · Accepted Answer

0

您是否指定了“ActionsEnabled=True”？

于 2019-12-30T07:29:34.113 回答

amazon-web-services - 向 EC2 自动扩展组添加容量的 AWS CloudWatch 警报一直处于警报状态

2 回答 2

Related

Reference