mpi - MPI Fox的算法非阻塞发送和接收

Question

我是 MPI 的新手，我正在尝试编写 Fox 算法的实现（AxB=C，其中 A 和 B 是维度 nxn 的矩阵）。我的程序运行良好，但我想看看我是否可以通过在矩阵 B 中的块移动期间与乘积矩阵的计算重叠通信来特别加速它（B 的块矩阵在算法）。根据算法，二维笛卡尔网格中的每个进程都有一个来自矩阵 A、B 和 C 的块。我目前拥有的是这个，它在 Fox 的算法中

if (stage > 0){  


   //shifting b values in all proccess

    MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
    MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);   
    MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);                         
    MPI_Wait(&my_request1, &status);
    MPI_Wait(&my_request2, &status);
    multiplyMatrix(a_temp,b,c,n_local);
}

子矩阵 a_temp、b、b_temp 是指向块 n/numprocess*n/numprocesses 的 double 类型的指针（这是块矩阵的大小，例如 b = (double *) calloc(n/numprocess*n/numprocesses, sizeof （双倍的）））。

我想在 MPI_Wait 调用之前使用 multiplyMatrix 函数（这将构成通信和计算的重叠），但我不知道该怎么做。我是否需要有两个单独的缓冲区并在不同阶段交替使用它们？

（我知道我可以使用 MPI_Sendrecv_replace 但这对重叠没有帮助，因为它使用阻塞发送和接收。MPI_Sendrecv 也是如此）

score 0 · Accepted Answer

我实际上想出了如何做到这一点。这个问题可能应该被删除。但是由于我是 MPI 的新手，所以我会在这里发布这些解决方案，如果有人有改进建议，我会很高兴他们分享。方法一：

// Fox's algorithm
 double * b_buffers[2];
 b_buffers[0] = (double *) malloc(n_local*n_local*sizeof(double));
 b_buffers[1] = b;
 for (stage =0;stage < q; stage++){
       //copying a into a_temp and Broadcasting a_temp of each proccess to all other proccess in its row
        for (i=0;i< n_local*n_local; i++)
            a_temp[i]=a[i];
        if (stage == 0) {
           MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
           multiplyMatrix(a_temp,b,c,n_local);
           MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);    
           MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
           MPI_Wait(&my_request2, &status);
           MPI_Wait(&my_request1, &status);
      }


       if (stage > 0)
       {        
           //shifting b values in all procces
            MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
            MPI_Isend(b_buffers[(stage)%2], n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);    
            MPI_Irecv(b_buffers[(stage+1)%2], n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
                multiplyMatrix(a_temp, b_buffers[(stage)%2], c, n_local);           
            MPI_Wait(&my_request2, &status);
            MPI_Wait(&my_request1, &status);

     }      
}

方法二：

// Fox's algorithm

 for (stage =0;stage < q; stage++){
       //copying a into a_temp and Broadcasting a_temp of each proccess to all other proccess in its row
        for (i=0;i< n_local*n_local; i++)
            a_temp[i]=a[i];
        if (stage == 0) {
           MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
           multiplyMatrix(a_temp,b,c,n_local);
           MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);    
           MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
           MPI_Wait(&my_request2, &status);
           MPI_Wait(&my_request1, &status);
      }


       if (stage > 0)
       {        
           //shifting b values in all proccess
            memcpy(b_temp, b, n_local*n_local*sizeof(double));
                MPI_Bcast(a_temp, n_local*n_local, MPI_DOUBLE, (rowID + stage) % q , row_comm);
            MPI_Isend(b, n_local*n_local, MPI_DOUBLE, nbrs[UP], 111, grid_comm,&my_request1);   
                MPI_Irecv(b, n_local*n_local, MPI_DOUBLE, nbrs[DOWN], 111, grid_comm,&my_request2);
                multiplyMatrix(a_temp, b_temp, c, n_local);         
               MPI_Wait(&my_request2, &status);
                MPI_Wait(&my_request1, &status);

     }

这两种方法似乎都有效，但正如我所说，我是 MPI 的新手，如果您有任何意见或建议，请分享。

mpi - MPI Fox的算法非阻塞发送和接收

1 回答 1

Related

Reference