尝试使用 sbatch 提交作业时出现以下错误:
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
当我使用不带参数的 sbatch 时,它运行良好,但是当我尝试使用 sbatch 传递任何参数(例如--job-name
or --export
)时,会出现上述错误。
我正在使用 openmpi 3 并使用 mpirun 运行 python 脚本。mpirun 和 orted 似乎都在使用相同的 openmpi 版本,正如which
在使用 mpirun 之前调用我的 slurm 脚本所证明的那样:
which mpirun: /opt/openmpi30/bin/mpirun
which orted: /opt/openmpi30/bin/orted
任何帮助将不胜感激。