1

我是MVAPICH2的新用户,刚开始使用时遇到了麻烦。
首先,我想我已经安装成功了,通过这个:
./configure --disable-fortran --enable-cuda
make -j 4
make install
没有错误。

但是当我试图在example的目录中运行cpi的例子时,我遇到了这样的情况:

  1. 我可以通过 ssh 连接节点 gpu-cluster-1 和 gpu-cluster-4 而无需密码;

  2. 我使用 mpirun_rsh 在 gpu-cluster-1 和 gpu-cluster-4 上分别运行 cpi 示例,它工作正常,就像这样:
    run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-1 gpu-cluster-1 ./cpi
    进程 0 of 2 在 gpu-cluster-1
    进程 1 of 2 在 gpu-cluster-1
    pi 大约为 3.1415926544231318,错误为 0.0000000008333387
    挂钟时间 = 0.000089

    run@gpu-cluster-4:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-4 gpu-cluster-4 ./cpi
    进程 0 的 2 在 gpu-cluster-4
    进程 1 2 在 gpu-cluster-4 上
    pi 约为 3.1415926544231318,误差为 0.0000000008333387
    挂钟时间 = 0.000134

  3. 我使用 mpiexec 在 gpu-cluster-1 和 gpu-cluster-4 上运行 cpi 示例,它工作正常,就像这样:
    run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpiexec -np 2 -f hostfile ./cpi
    Process 0 of 2 is on gpu-cluster-1
    Process 1 of 2 is on gpu-cluster-4
    pi 约为 3.1415926544231318,错误为 0.0000000008333387
    挂钟时间 = 0.000352 hostfile
    中的内容为“gpu-集群 1\ngpu 集群 4"

  4. 但是,当我在 gpu-cluster-1 和 gpu-cluster-4 上使用 mpirun_rsh、borh 运行 cpi 示例时,问题出现了:

    run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh - np 2 -hostfile hostfile ./cpi Process 1 of 2 is on gpu-cluster-4
    -----------------卡在这里,不继续-------- -----------------
    很长一段时间后,我按 Ctrl + C,它会显示:

    ^C[gpu-cluster-1:mpirun_rsh][signal_processor] Caught signal 2 ,杀死作业
    run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI 进程死了?
    [gpu-cluster-4:mpispawn_1][read_size] 文件描述符 6 上出现意外的文件结束。MPI 进程死了?
    [gpu-cluster-4:mpispawn_1][handle_mt_peer] 读取 PMI 套接字时出错。MPI进程死了?
    [gpu-cluster-4:mpispawn_1][report_error] connect() failed: Connection denied (111)
    困惑了很久,你能帮我解决这个问题吗?

以下是 cpi 示例的代码:

#include "mpi.h" #include <stdio.h> #include <math.h> double f(double); double f(double a) { return (4.0 / (1.0 + a*a)); } int main(int argc,char *argv[]) { int n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; double startwtime = 0.0, endwtime; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen); fprintf(stdout,"Process %d of %d is on %s\n", myid, numprocs, processor_name); fflush(stdout); n = 10000; /* default # of rectangles */ if (myid == 0) startwtime = MPI_Wtime(); MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); h = 1.0 / (double) n; sum = 0.0; /* A slightly better approach starts from large i and works back */ for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) { endwtime = MPI_Wtime(); printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); printf("wall clock time = %f\n", endwtime-startwtime); fflush(stdout); } MPI_Finalize(); return 0; }

4

0 回答 0