r - 让 mpirun 识别每个节点上的所有核心

Question

我为 MPI 设置了两个节点，aml1（master）和 aml2（worker）。我正在尝试将 mpirun 与 R 脚本一起使用，并使用 Rmpi 和 doMPI 库。两台机器的规格相同：

On RHEL 7.3
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
Stepping:              7
CPU MHz:               2900.000
BogoMIPS:              5790.14
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

如果您想查看 hwloc lstopo 输出。

我正在使用 OpenMPI 1.10.5，我可以看到在 aml1 和 aml2 上运行的进程。但是，当我增加从 mpirun 生成的工作程序的数量时，我看不到我的测试脚本运行得更快，因此我看不到计算时间有任何减少。这让我假设 mpirun 没有正确检测有多少内核可用，或者我在主机文件或排名文件中错误地分配了它。

如果我将我的主机文件或排名文件更改为不同的插槽值：

$ cat hosts
aml1 slots=4 max_slots=8  #I can change this to 10 slots
aml2 slots=4

$ cat rankfile
rank 0=aml1 slot=0:0   
rank 1=aml1 slot=0:1
rank 2=aml1 slot=0:2
rank 3=aml1 slot=0:3
rank 4=aml2 slot=0:6
rank 5=aml2 slot=0:7    #I can add more ranks

然后我运行：

$ mpirun -np 1 --hostfile hosts --rankfile rankfile R --slave -f example7.R

$ cat example7.R
library(doMPI)
cl <- startMPIcluster(verbose=TRUE)
registerDoMPI(cl)

system.time(x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
 set.seed(seed)
 rnorm(90000000)
 })

closeCluster(cl)
mpi.quit(save="no")

我仍然得到类似的系统运行时间：

Spawning 5 workers using the command:
 5 slaves are spawned successfully. 0 failed.
   user  system elapsed
  9.023   7.396  16.420

Spawning 25 workers using the command:
 25 slaves are spawned successfully. 0 failed.
   user  system elapsed
  4.752   8.755  13.508

我还尝试使用 tm 配置选项设置 Torque 和构建 openmpi，但我遇到了单独的问题。我相信我不需要使用 Torque 来完成我想做的事情，但请确认我是否不正确。

我想做的是用 Rmpi 和 doMPI 运行一个 R 脚本。R 脚本本身应该只运行一次，并在集群中生成一段代码。我想最大化两个节点（aml，aml2）上可用的核心。

感谢社区的任何帮助！

更新 1

这里有更多细节：我运行以下命令，为每次运行更改主机文件：

$ mpirun -np 1 --hostfile hosts [using --map-by slot or node] R --slave -f example7.R
+----------------+-----------------+-----------------+
|                | //--map-by node | //--map-by slot |
+----------------+-----------------+-----------------+
| slots per host | time            | time            |
| 2              | 24.1            | 24.109          |
| 4              | 18              | 12.605          |
| 4              | 18.131          | 12.051          |
| 6              | 18.809          | 12.682          |
| 6              | 19.027          | 12.69           |
| 8              | 18.982          | 12.82           |
| 8              | 18.627          | 12.76           |
+----------------+-----------------+-----------------+

我应该减少时间吗？或者这是最好的吗？我觉得我应该能够将每台主机的插槽增加到 30 个以获得最佳性能，但它的峰值约为每台主机 4 个插槽。

score 0 · Accepted Answer

我想我找到了自己问题的答案。

由于我是新手，我假设 Torque 会自动使用机器/节点上可用的所有“核心”。由于我有 32 个核心，我预计每个节点会产生 32 个工人。但实际上，有 16 个物理内核，这 16 个内核中的每一个都具有超线程，这使得一台机器上可以使用 16x2 内核。据我了解，Torque 每个处理器（或本例中的物理内核）启动一个进程。所以我不应该期望每个节点产生 32 个工人。

我查看了有关 NUMA 支持的更多信息，并且根据Open MPI FAQ，RHEL 通常需要在构建之前安装 numactl-devel 包以支持内存关联。所以我对每个节点都这样做了，我实际上能够通过 Torque 运行 R 脚本，定义 8 个内核，或者每个节点 16 个内核。现在计算时间非常相似。如果我将每个节点增加到 18/20 个内核，那么性能会按预期下降。

下面分别是我的 Torque 和 Open MPI 的 .configure 选项：

./configure --enable-cgroups --with-hwloc-path=/usr/local --enable-autorun --prefix=/var/spool/torque 


./configure --prefix=/var/nfsshare/openmpi1.10.5-tm-3 --with-tm=/var/spool/torque/

r - 让 mpirun 识别每个节点上的所有核心

更新 1

1 回答 1

Related

Reference