1

我有一个状态机,它由一个 Map 任务组成,它启动了许多 Fargate 任务(30+)一个非常相似的任务定义。任务之间的唯一区别是ContainerOverrides块中的环境变量。

任务定义:

"CalculateTask": {
    "Type": "Task",
    "Resource": "arn:aws:states:::ecs:runTask.sync",
    "Retry": [
        {
            "ErrorEquals": [
                "States.ALL"
            ],
            "IntervalSeconds": 10,
            "MaxAttempts": 2,
            "BackoffRate": 1.5
        }
    ],
    "Parameters": {
        "LaunchType": "FARGATE",
        "Cluster": "arn:aws:ecs:region:111111111:cluster/cluster-name",
        "TaskDefinition": "arn:aws:ecs:region:111111111:task-definition/task-definition:44",
        "NetworkConfiguration": {
            "AwsvpcConfiguration": {
                "Subnets": [
                    "subnet-1111111111111111","subnet-2222222222222222","subnet-3333333333333333"
                ],
                ...
            }
        },
        "Overrides": {
            "ContainerOverrides": [
                {
                    "Name": "Phase-1-start",
                    "Environment": [
                        {
                            "Name": "COMMAND",
                            "Value": "calculateGas/Oil/PeakGas..."
                        }
                    ]
                }
            ]
        }
    }
}

当我运行我的 State Machibe 任务时,总是失败StoppedReason

"StopCode": "TaskFailedToStart",
    "StoppedAt": 1618584363236,
    "StoppedReason": "Unexpected EC2 error while attempting to Create Network Interface with public IP assignment 
    enabled in subnet 'subnet-2222222222222222': InsufficientFreeAddressesInSubnet",

我不明白为什么会出现这个问题,我提供了 3 个子网 ID 供 ECS 选择。

4

1 回答 1

1

我有同样的问题。根本原因最终是我开始使用run_task的 Fargate 任务由于某种原因没有正确终止。他们最终处于“非活动”状态并徘徊了几个月。他们没有正确终止的事实意味着他们没有在子网中发布他们的 IP 地址。这意味着新任务无法获得 IP 并且会失败。

要修复,我必须:

  1. 登录 AWS 控制台
  2. 前往 ECR 服务
  3. 点击Clusters页面
  4. 单击有问题的集群(可能是带有一堆 的集群Running Tasks
  5. 单击Tasks选项卡
  6. 选择所有[INACTIVE]实例
  7. 点击Stop停止任务

除了清理这些不活动的实例之外,我还添加了一些额外的代码/警报以确保不会发现此问题:

def invoke_fargate(cw_metrics, YOUR_ARGS_HERE)
    client = boto3.client("ecs", region_name=get_aws_region())
    response = client.run_task(YOUR_CODE_HERE)

    # Honestly not sure if this is required...better safe than sorry?
    _LOGGER.info("Starting to sleep to allow `run_task` chance to kick of container")
    time.sleep(30)

    task_arn = response["tasks"][0]["taskArn"]
    description = client.describe_tasks(cluster=cluster_name, tasks=[task_arn])
    _LOGGER.info("%s", description)

    for status_dict in description["tasks"]:
        if status_dict.get("stopCode") in ["TaskFailedToStart"]:
            cw_metrics.trigger_alarm("FARGATE_INVOCATION_FAILED")
    _LOGGER.info("Done with Fargate invocation")
于 2021-11-12T16:31:56.553 回答