4

我无法Open MPI通过Slurm.Slurm-script

一般来说,我能够获取主机名并Open MPI在我的机器上运行。

$ mpirun hostname
myHost
$ cd NPB3.3-SER/ && make ua CLASS=B && mpirun -n 1 bin/ua.B.x inputua.data # Works

但是,如果我通过 slurm-script 执行相同的操作,则mpirun hostname返回空字符串,因此我无法运行mpirun -n 1 bin/ua.B.x inputua.data

slurm-script.sh:

#!/bin/bash
#SBATCH -o slurm.out        # STDOUT
#SBATCH -e slurm.err        # STDERR
#SBATCH --mail-type=ALL

export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
mpirun hostname > output.txt # Returns empty
cd NPB3.3-SER/ 
make ua CLASS=B 
mpirun --host myHost -n 1 bin/ua.B.x inputua.data
$ sbatch -N1 slurm-script.sh
Submitted batch job 1

我收到的错误:

There are no allocated resources for the application
  bin/ua.B.x
that match the requested mapping:    
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.

A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
------------------------------------------------------------------
4

2 回答 2

2

如果 Slurm 和 OpenMPI 是最新版本,请确保使用 Slurm 支持编译 OpenMPI(运行ompi_info | grep slurm 以查找)并srun bin/ua.B.x inputua.data在您的提交脚本中运行。

或者,mpirun bin/ua.B.x inputua.data也应该工作。

如果 OpenMPI 是在没有 Slurm 支持的情况下编译的,那么以下应该可以工作:

srun hostname > output.txt
cd NPB3.3-SER/ 
make ua CLASS=B 
mpirun --hostfile output.txt -n 1 bin/ua.B.x inputua.data

还要确保通过运行export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"您不会覆盖其他必要的库路径。可能会更好export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/lib/openmpi/lib"(或者如果您想避免最初为空的前导,则可以使用更复杂的版本。):

于 2019-03-29T15:02:59.883 回答
0

您需要的是:1)运行mpirun,2)从slurm,3)使用--host。要确定谁对此负责(问题 1),您可以测试一些事情。无论您测试什么,都应该通过命令行 ( CLI ) 和通过( S ) 进行完全相同的测试。据了解,其中一些测试在CLIS情况下会产生不同的结果。slurm

一些注意事项是:1)您没有在 CLI 和 S 中测试完全相同的东西。2)您说您“无法运行mpirun -n 1 bin/ua.B.x inputua.data”,而问题实际上出在mpirun --host myHost -n 1 bin/ua.B.x inputua.data. mpirun hostname > output.txt3)返回空文件的事实(问题 2)不一定与您的主要问题具有相同的起源,请参见上面的段落。您可以通过使用scontrol show hostnames 或与环境变量SLURM_NODELISTscontrol show hostnames基于)一起解决此问题,但这不会解决问题 1。


要解决不是最重要的问题 2 ,请通过 CLI 和 S 尝试一些操作。下面的 slurm 脚本可能会有所帮助。

#SBATCH -o slurm_hostname.out        # STDOUT
#SBATCH -e slurm_hostname.err        # STDERR
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/lib64/openmpi/lib"

mpirun hostname > hostname_mpirun.txt               # 1. Returns values ok for me
hostname > hostname.txt                             # 2. Returns values ok for me
hostname -s > hostname_slurmcontrol.txt             # 3. Returns values ok for me
scontrol show hostnames > hostname_scontrol.txt     # 4. Returns values ok for me
echo ${SLURM_NODELIST} > hostname_slurmcontrol.txt  # 5. Returns values ok for me

(有关export命令的解释,请参阅)。根据您的说法,我了解 2、3、4 和 5 对您来说可以,而 1 则不行。因此,您现在可以使用mpirun合适的选项--host--hostfile.

scontrol show hostnames请注意(例如,对我而言cnode17<newline>cnode18)和echo ${SLURM_NODELIST}( )的不同输出格式cnode[17-18]

主机名也许也可以在使用%h和动态设置的文件名%n中获得slurm.conf,例如SlurmdLogFile,查找SlurmdPidFile


要诊断/解决/解决问题 1,请在 CLI 和 S 中尝试mpirun使用/不使用--host。根据您所说的,假设您在每种情况下都使用了正确的语法,结果如下:

  1. mpirun, CLI(原帖)。“作品”。

  2. mpirun, S(评论?)。与以下第 4 项相同的错误?请注意,mpirun hostname在 S 中应该在您的slurm.err.

  3. mpirun --host, CLI(评论)。错误

    There are no allocated resources for the application bin/ua.B.x that match the requested mapping:
    ...
    This may be because the daemon was unable to find all the needed shared
    libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
    location of the shared libraries on the remote nodes and this will
    automatically be forwarded to the remote nodes.
    
  4. mpirun --host, S(原帖)。错误(与上面的第 3 项相同?)

    There are no allocated resources for the application
      bin/ua.B.x
    that match the requested mapping:    
    ------------------------------------------------------------------
    Verify that you have mapped the allocated resources properly using the
    --host or --hostfile specification.
    ...
    This may be because the daemon was unable to find all the needed shared
    libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
    location of the shared libraries on the remote nodes and this will
    automatically be forwarded to the remote nodes.
    

根据评论,您可能设置了错误的LD_LIBRARY_PATH路径。您可能还需要使用mpi --prefix ...

有关的? https://github.com/easybuilders/easybuild-easyconfigs/issues/204

于 2019-04-01T07:48:10.200 回答