0

问题

对于在 HPC 系统上部署奇异软件容器,是否更好

  1. 从主机复制,
  2. 绑定到主机,
  3. 或在引导期间安装

相关的 HPC 库到容器中?如果策略 1. 或 2. 通常是可推荐的,我如何找出哪些库需要复制/绑定以及从哪里/到哪里?

更好可能是指更好的易用性、更好的解决方案稳定性和效率,或更好的解决方案独立性和可重复性。

到目前为止,我主要尝试了策略 3。并且依赖于关于要安装哪些库的错误或警告消息。然而,这并不成功。


背景

容器的最终目标是在 HPC 系统上通过 openMPI 并行运行 R。对我来说,并行运行的最小引导定义文件看起来像这样。

Bootstrap: debootstrap
OSVersion: xenial
MirrorURL: http://archive.ubuntu.com/ubuntu/

%post
  # add universe repository
  sed -i 's/main/main universe/g' /etc/apt/sources.list

  apt-get update    
  apt-get install -y --no-install-recommends r-base-dev libopenmpi-dev openmpi-bin
  apt-get clean

  # directory will be bound to host
  mkdir /etc/libibverbscd .d

  # Interface R and MPI
  R --slave -e 'install.packages("doMPI", repos="http://cloud.r-project.org/")'


%runscript
  R -e "library(doMPI); cl <- startMPIcluster(count = 5); registerDoMPI(cl); foreach(i=1:5) %dopar% Sys.sleep(10); closeCluster(cl); mpi.quit()"

有了这个我可以执行

singularity run -B /etc/libibverbs.d/:/etc/libibverbs.d/ test.img

并收到一些警告消息,但(到目前为止)它有效。警告:

libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
--------------------------------------------------------------------------
[[12293,2],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ****

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
.
.
.
[****:01978] 4 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[****:01978] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

我已经尝试安装包libipathverbs1libmthca1,这将使警告消息消失,但是并行运行失败:

An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          ****
  MPI_COMM_WORLD rank: 1

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
> -------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

这里建议绑定相关库,但我不确定我需要哪些或哪些其他库,甚至不确定如何找到它(除了非常繁琐的反复试验)。

4

1 回答 1

0

根据OMPI FAQ,在使用 IB 时不能调用 fork,除非在 fork 之后直接调用 exec。我敢打赌你有另一个程序或库在你的代码中分叉,这会使 OpenMPI 崩溃。

于 2018-01-09T18:20:56.080 回答