我们正在尝试使用 AWS 上的 ECS 启动一个 dask 集群。我们当前的设置:
- 两个服务 - 一个 dask-scheduler 服务和一个 dask-worker 服务,每个服务都有一个任务定义。每个服务都有一个任务(将来 dask-worker 任务可以扩展)。
- dask-scheduler 将端口 8786、8787 和 9786 从容器映射到主机。dask-worker 任务不映射任何端口。
- 一个经典的负载均衡器位于 dask-scheduler 前面,并监听 TCP 上的这三个端口。即使我们只有一个 dask-scheduler 任务,负载均衡器也会在调度程序重新启动时提供静态地址。
- dask-worker 以负载均衡器的 arg 启动。dask-scheduler 以无参数启动。
不幸的是,我运气不太好。我收到这些日志消息:
06:10:24
distributed.core - INFO - Connection from 172.31.35.94:49003 to Scheduler
06:10:24
distributed.core - INFO - Lost connection: ('172.31.35.94', 49003)
06:10:24
distributed.core - INFO - Close connection from 172.31.35.94:49003 to Scheduler
06:10:54
distributed.core - INFO - Connection from 172.31.35.94:49009 to Scheduler
06:10:54
distributed.core - INFO - Lost connection: ('172.31.35.94', 49009)
06:10:54
distributed.core - INFO - Close connection from 172.31.35.94:49009 to Scheduler
06:11:07
distributed.core - INFO - Connection from 172.31.35.94:49018 to Scheduler
06:11:07
distributed.core - INFO - Connection from 172.31.35.94:49019 to Scheduler
06:11:07
distributed.scheduler - INFO - Receive client connection: 941a5c1a-8ac2-11e6-a74c-0242ac110001
06:11:24
distributed.core - INFO - Connection from 172.31.35.94:49023 to Scheduler
06:11:24
distributed.core - INFO - Lost connection: ('172.31.35.94', 49023)
06:11:24
distributed.core - INFO - Close connection from 172.31.35.94:49023 to Scheduler
06:11:54
distributed.core - INFO - Connection from 172.31.35.94:49033 to Scheduler
06:11:54
distributed.core - INFO - Lost connection: ('172.31.35.94', 49033)
06:11:54
distributed.core - INFO - Close connection from 172.31.35.94:49033 to Scheduler
我认为这是负载均衡器的问题。当我使用静态 IP 运行相同的设置时,它工作正常。
任何想法为什么这应该是一个问题?我尝试过以--no-nanny
模式运行,我尝试将负载均衡器地址传递给--host
调度程序,但无济于事。