bash - 如何在 Cray XE6 计算节点（Unix like env）上使用 GNU 并行（bash 脚本）和 aprun 命令？

Question

我正在尝试在 mpi4py python 脚本上运行 16 个实例：hello.py。我在 s.txt 中存储了 16 个此类命令：

python /lustre/4_mpi4py/hello.py > 01.out

我通过这样的 aprun 命令在 Cray 集群中提交这个：

aprun -n 32 sh -c 'parallel -j 8 :::: s.txt'

我的意图是当时每个节点运行 8 个这些 python 作业。脚本运行了 3 个多小时，并且没有创建任何 *.out 文件。从 PBS 调度程序输出文件我得到这个：

Python version 2.7.3 loaded
aprun: Apid 11432669: Caught signal Terminated, sending to application
aprun: Apid 11432669: Caught signal Terminated, sending to application
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 02.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 06.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 10.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 08.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out

我在一个节点上运行它，它有 32 个内核。我想我对 GNU 并行命令的使用是错误的。有人可以帮忙吗？

score 1 · Accepted Answer

如https://portal.tacc.utexas.edu/documents/13601/1102030/4_mpi4py.pdf#page=8中所列

from mpi4py import MPI

comm = MPI . COMM_WORLD

print " Hello ! I’m rank %02d from %02 d" % ( comm .rank , comm . size )

print " Hello ! I’m rank %02d from %02 d" % ( comm . Get_rank () ,
comm . Get_size () )

print " Hello ! I’m rank %02d from %02 d" %
( MPI . COMM_WORLD . Get_rank () , MPI . COMM_WORLD . Get_size () )

您的4_mpi4py/hello.py程序不是典型的单进程（或单个 python 脚本），而是多进程 MPI 应用程序。

GNUparallel需要更简单的程序并且不支持与 MPI 进程的交互。

在您的集群中有许多节点，每个节点可能会启动不同数量的 MPI 进程（每个节点有 2 个 8 核 CPU 考虑变体：2 个 MPI 进程，每个进程有 8 个 OpenMP 线程；1 个 MPI 进程有 16 个线程；16 个 MPI 进程没有线程）。为了向您的任务描述集群切片，集群管理软件和脚本使用的 python MPI 包装器使用的 MPI 库之间存在一些接口。管理层是aprun（和qsub？）：

http://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/aprun-man-page/

https://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/

您必须使用 aprun 命令在 Hopper 计算节点上启动作业。将它用于串行、MPI、OpenMP、UPC 和混合 MPI/OpenMP 或混合 MPI/CAF 作业。

https://wickie.hlrs.de/platforms/index.php/CRAY_XE6_Using_the_Batch_System

XE6 并行作业（MPI 和 OpenMP）的作业启动器是 aprun。... 上面的 aprun 示例将使用参数“arg1”和“arg2”启动并行可执行文件“my_mpi_executable”。该作业将使用 64 个 MPI 进程开始，其中 32 个进程放置在您分配的每个节点上（请记住，在 XE6 系统中，一个节点由 32 个核心组成）。您需要在 (qsub) 之前由批处理系统分配节点。

aprun和MPI之间有一些接口qsub：在正常启动 ( aprun -n 32 python /lustre/4_mpi4py/hello.py) 中，aprun 只是启动 MPI 程序的几个 (32) 进程，在接口中设置每个进程的 id 并为它们提供组 id（例如，使用环境变量，如PMI_ID；实际变量特定于启动器/MPI 库组合）。

GNUparallel对 MPI 程序没有任何接口，它对这些变量一无所知。它只会启动比预期多 8 倍的进程。并且您错误命令中的所有 32 * 8 进程都将具有相同的组 ID；并且将有 8 个具有相同 MPI 进程 ID 的进程。它们会使您的 MPI 库行为不端。

永远不要将 MPI 资源管理器/启动器与古老的 MPI 之前的 unix 进程分叉器（如xargs或parallel或“用于并行性的非常先进的 bash 脚本”）混合使用。有 MPI 可以做一些并行的事情；并且有 MPI 启动器/作业管理（aprun、mpirun、mpiexec）用于启动多个进程/分叉/ssh-ing 到机器。

不要这样做aprun -n 32 sh -c 'parallel anything_with_MPI'-这是不受支持的组合。唯一可能的（允许的）参数aprun是一些支持的并行程序，如 OpenMP、MPI、MPI+OpenMP 或非并行程序。（或启动一个并行程序的单个脚本）

如果要启动多个独立的 MPI 任务，请使用多个参数aprun：aprun -n 8 ./program_to_process_file1 : -n 8 ./program_to_process_file2 -n 8 ./program_to_process_file3 -n 8 ./program_to_process_file4

如果您有多个文件要处理，请尝试启动许多并行作业，不要使用单个qsub，而是使用多个，并允许 PBS（或使用哪个作业管理器）来管理您的作业。

如果你有非常多的文件，尽量不要在你的程序中使用 MPI（永远不要链接 MPI 库/包含 MPI 头文件）并使用parallel或其他形式的古老并行性，这是隐藏在aprun. 或者直接在您的代码中使用单个 MPI 程序和程序文件分发（MPI 的主进程可能会打开文件列表，然后在其他 MPI 进程之间分发文件 - 有或没有 MPI / mpi4py 的动态进程管理：http: //pythonhosted.org/ mpi4py/usrman/tutorial.html#dynamic-process-management）。

一些科学家试图将 MPI 和并行以其他顺序结合：parallel ... aprun ...或parallel ... mpirun ...：

https://rcc.uchicago.edu/docs/tutorials/kicp-tutorials/running-jobs.html#gnu-parallel
http://www.hpc.lsu.edu/training/weekly-materials/2017-Spring/gnuparallel-Feb2017.pdf#page=41
并且您的 Cray 有并行版本：https ://github.com/levinas/cray-parallel

bash - 如何在 Cray XE6 计算节点（Unix like env）上使用 GNU 并行（bash 脚本）和 aprun 命令？

1 回答 1

Related

Reference