python - Dask 与 HTCondor 调度程序

Question

背景

我有一个带有并行步骤的图像分析管道。管道在python，并行化由控制dask.distributed。最小处理设置有 1 个调度程序 + 3 个工作程序，每个工作程序有 15 个进程。在分析的第一个简短步骤中，我使用 1 个进程/工作者，但节点的所有 RAM 然后在所有其他分析步骤中使用所有节点和进程。

问题

管理员将安装HTCondor为集群的调度程序。

想法

为了让我的代码在新设置上运行，我计划使用SGE 的 dask 手册中显示的方法，因为集群有一个共享的网络文件系统。

# job1 
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json

# Job2
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json

# Job3 
# Start a process with the python code where the client is started this way
client = Client(scheduler_file='/path/to/scheduler.json')

问题和建议

如果我对这种方法的理解正确，我会将调度程序、工作人员和分析作为独立的作业（不同的 HTCondor 提交文件）启动。如何确保执行顺序正确？有没有一种方法可以使用我以前使用的相同处理方法，或者更有效地翻译代码以更好地与 HTCondor 一起使用？谢谢您的帮助！

score 1 · Accepted Answer

HTCondor JobQueue 支持已被合并 ( https://github.com/dask/dask-jobqueue/pull/245 )，现在应该可以在 Dask JobQueue ( HTCondorCluster(cores=1, memory='100MB', disk='100MB'))

python - Dask 与 HTCondor 调度程序

背景

问题

想法

问题和建议

1 回答 1

Related

Reference