2

我使用来自 Intel MKL 的 ScaLAPACK,并在 Windows 7 计算机网络上安装了 MPICH2。当我在单台机器上运行作业时它工作正常,例如:

mpiexec -n 16 Z:/myprogram.exe       # Z:/ is a mapped network drive

此外,当我运行时,它在网络上运行良好cpi.exe(测试程序与 MPICH2 一起提供):

mpiexec -hosts 4 host1 4 host2 4 ... Z:/cpi.exe

我的代码(大线性方程组的解)在单台机器上也能正常工作。但是,当我这样做时它失败了:

mpiexec -hosts 2 host1 8 host2 8 Z:/myprogram.exe

消息是:

Fatal error in PMPI_Comm_create: Other MPI error, error stack:
PMPI_Comm_create(609)................: MPI_Comm_create(MPI_COMM_WORLD, group=0x8
8000001, new_comm=001DF644) failed
PMPI_Comm_create(590)................:
MPIR_Comm_create_intra(250)..........:
MPIR_Get_contextid(521)..............:
MPIR_Get_contextid_sparse(683).......:
MPIR_Allreduce_impl(712).............:
MPIR_Allreduce_intra(197)............:
allreduce_intra_or_coll_fn(106)......:
MPIR_Allreduce_intra(357)............:
MPIC_Sendrecv(192)...................:
MPIC_Wait(540).......................:
MPIDI_CH3I_Progress(402).............:
MPID_nem_mpich2_blocking_recv(905)...:
MPID_nem_newtcp_module_poll(37)......:
MPID_nem_newtcp_module_connpoll(2656):
gen_cnting_fail_handler(1739)........: connect failed - Semaphore timeout period has expired. (errno 121)

job aborted:
rank: node: exit code[: error message]
0: 10.30.10.182: 1: process 0 exited without calling finalize
1: 10.30.10.184: 123
2: 10.30.10.184: 123
3: 10.30.10.184: 123
4: 10.30.10.184: 123
5: 10.30.10.184: 123
6: 10.30.10.184: 123
7: 10.30.10.184: 123
8: 10.30.10.184: 123
9: 10.30.10.184: 123
10: 10.30.10.184: 123
11: 10.30.10.184: 123
12: 10.30.10.184: 123
13: 10.30.10.184: 123
14: 10.30.10.184: 123
15: 10.30.10.184: 123

据我现在所知,问题出现在我的 C 代码的早期:

Cblacs_pinfo(&mype, &npe); // OK, gets correct values
Cblacs_get(-1, 0, &icon); // OK, gets correct value
Cblacs_gridinit(&icon, "c", mp, np); // Never returns from this line.

我将不胜感激任何帮助或建议。解决这个问题对我来说非常重要。

编辑:下面的代码确实有效,所以我的 MPI 基础设施似乎还可以..?

int main(int argc, char* argv[])
{

 int numprocs, rank, namelen;
 char processor_name[MPI_MAX_PROCESSOR_NAME];
 printf("start\n");

 MPI_Init(&argc, &argv);
 MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Get_processor_name(processor_name, &namelen);

 printf( "Process %d on %s out of %d\n" , rank, processor_name, numprocs);

 MPI_Finalize();
}
4

0 回答 0