python - 在 GridEngine 集群的多个节点上运行作业

Question

我可以访问一个 128 核集群，我想在该集群上运行并行作业。该集群使用 Sun GridEngine，我的程序是使用 Parallel Python、numpy、scipy 在 Python 2.5.8 上编写的。在单个节点（4 核）上运行作业会比单核产生约 3.5 倍的改进。我现在想把它提升到一个新的水平，并将工作拆分到大约 4 个节点上。我的qsub脚本看起来像这样：

#!/bin/bash
# The name of the job, can be whatever makes sense to you
#$ -N jobname

# The job should be placed into the queue 'all.q'.
#$ -q all.q

# Redirect output stream to this file.
#$ -o jobname_output.dat

# Redirect error stream to this file.

#$ -e jobname_error.dat

# The batchsystem should use the current directory as working directory.
# Both files will be placed in the current
# directory. The batchsystem assumes to find the executable in this directory.
#$ -cwd

# request Bourne shell as shell for job.
#$ -S /bin/sh

# print date and time
date

# spython is the server's version of Python 2.5. Using python instead of spython causes the program to run in python 2.3
spython programname.py

# print date and time again
date

有谁知道如何做到这一点？

score 2 · Accepted Answer

是的，您需要-np 16在脚本中包含 Grid Engine 选项，如下所示：

# Use 16 processors
#$ -np 16

或在提交脚本时在命令行上。或者，对于更永久的安排，使用.sge_request文件。

在我曾经使用过的所有 GE 安装中，这将在尽可能少的节点上为您提供 16 个处理器（或这些天的处理器内核），因此如果您的节点有 4 个内核，您将获得 4 个节点，如果它们有 8 个 2 和很快。要完成这项工作，假设 8 个节点上的 2 个内核（如果每个进程需要大量内存，您可能想要这样做）有点复杂，您应该咨询您的支持团队。

python - 在 GridEngine 集群的多个节点上运行作业

1 回答 1

Related

Reference