我正在使用 AMD 2990WX (Ubuntu 18.04) 运行 LAMMPS 模拟。
当我使用 mpirun 只运行一个 LAMMPS 作业时,如下所示。
#!/bin/sh
LAMMPS_HOME=/APP/LAMMPS/src
MPI_HOME=/APP/LIBS/OPENMPI2
Tf=0.30
$MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf}
我没有问题,模拟按照我的意愿进行。
但是当我运行下面的脚本时。每个 LAMMPS 作业几乎是单个 LAMMPS 作业的 3 倍。因此,我在并行环境中没有性能提升(因为 3 个作业的运行速度是单个作业的 1/3)
#!/bin/sh
LAMMPS_HOME=/APP/LAMMPS/src
MPI_HOME=/APP/LIBS/OPENMPI2
Tf=0.30
$MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}
没有主机文件my_host
,它是一样的。主机文件如下:
<hostname> slots=32 max-slots=32
我安装了 openmpi --with-cuda
, fftw--enable-shared
和 LAMMPS 几个包。
我已经尝试过 openmpi v1.8、v3.0、v4.0 和 fftw v3.3.8。RAM足够了,存储也足够了。我还检查了平均负载和核心使用情况。当我运行第二个脚本时,它们显示机器使用 24 个内核(或相应的负载)。sh first.sh
当我在单独的终端(即每个终端)中同时运行第一个脚本的副本时,会发生同样的问题。
我使用 bash 脚本有什么问题吗?mpirun
或者(或 LAMMPS)+ Ryzen是否存在任何已知问题?
更新
我已经测试了以下脚本:
/bin/sh
LAMMPS_HOME=/APP/LAMMPS/src
MPI_HOME=/APP/LIBS/OPENMPI2
Tf=0.30
$MPI_HOME/bin/mpirun --cpu-set 0-7 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 8-15 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 16-23 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}
结果显示如下:
[<hostname>:09617] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 4 bound to socket 0[core 20[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../..]
[<hostname>:09619] MCW rank 5 bound to socket 0[core 21[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../..]
[<hostname>:09619] MCW rank 6 bound to socket 0[core 22[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../..]
[<hostname>:09619] MCW rank 7 bound to socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../..]
[<hostname>:09619] MCW rank 0 bound to socket 0[core 16[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 1 bound to socket 0[core 17[hwt 0-1]]: [../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 2 bound to socket 0[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 3 bound to socket 0[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 4 bound to socket 0[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 5 bound to socket 0[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 6 bound to socket 0[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 7 bound to socket 0[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 0 bound to socket 0[core 8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 1 bound to socket 0[core 9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 2 bound to socket 0[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 3 bound to socket 0[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../..]
我对 MPI 了解不多,但对我来说,它并没有显示出任何奇怪的地方。有什么问题吗?