c++ - 为什么 mpi_bcast 比 mpi_reduce 慢这么多？

Question

使用 MPI，我们可以进行广播以将数组发送到多个节点，或者进行缩减以将来自多个节点的数组组合到一个节点上。

我猜想实现这些的最快方法是使用二叉树，其中每个节点要么发送到两个节点（bcast），要么减少两个节点（reduce），这将给出节点数量的时间对数。

似乎没有任何理由表明广播会比减少特别慢？

我在 4 台计算机集群上运行了以下测试程序，其中每台计算机有 12 个内核。奇怪的是，广播比减少慢得多。为什么？我能做些什么吗？

结果是：

inited mpi: 0.472943 seconds
N: 200000 1.52588MB
P = 48
did alloc: 0.000147641 seconds
bcast: 0.349956 seconds
reduce: 0.0478526 seconds
bcast: 0.369131 seconds
reduce: 0.0472673 seconds
bcast: 0.516606 seconds
reduce: 0.0448555 seconds

代码是：

#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <ctime>
#include <sys/time.h>
using namespace std;

#include <mpi.h>

class NanoTimer {
public:
   struct timespec start;

   NanoTimer() {
      clock_gettime(CLOCK_MONOTONIC,  &start);

   }
   double elapsedSeconds() {
      struct timespec now;
      clock_gettime(CLOCK_MONOTONIC,  &now);
      double time = (now.tv_sec - start.tv_sec) + (double) (now.tv_nsec - start.tv_nsec) * 1e-9;
      start = now;
      return time;
   }
    void toc(string label) {
        double elapsed = elapsedSeconds();
        cout << label << ": " << elapsed << " seconds" << endl;        
    }
};

int main( int argc, char *argv[] ) {
    if( argc < 2 ) {
        cout << "Usage: " << argv[0] << " [N]" << endl;
        return -1;
    }
    int N = atoi( argv[1] );

    NanoTimer timer;

    MPI_Init( &argc, &argv );
    int p, P;
    MPI_Comm_rank( MPI_COMM_WORLD, &p );
    MPI_Comm_size( MPI_COMM_WORLD, &P );
    MPI_Barrier(MPI_COMM_WORLD);
    if( p == 0 ) timer.toc("inited mpi");
    if( p == 0 ) {
        cout << "N: " << N << " " << (N*sizeof(double)/1024.0/1024) << "MB" << endl;
        cout << "P = " << P << endl;
    }
    double *src = new double[N];
    double *dst = new double[N];
    MPI_Barrier(MPI_COMM_WORLD);
    if( p == 0 ) timer.toc("did alloc");

    for( int it = 0; it < 3; it++ ) {    
        MPI_Bcast( src, N, MPI_DOUBLE, 0, MPI_COMM_WORLD );    
        MPI_Barrier(MPI_COMM_WORLD);
        if( p == 0 ) timer.toc("bcast");

        MPI_Reduce( src, dst, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
        MPI_Barrier(MPI_COMM_WORLD);
        if( p == 0 ) timer.toc("reduce");
    }

    delete[] src;

    MPI_Finalize();
    return 0;
}

集群节点运行 64 位 ubuntu 12.04。我尝试了 openmpi 和 mpich2，得到了非常相似的结果。网络是千兆以太网，不是最快的，但我最好奇的不是绝对速度，而是broadcast和reduce之间的差距。

score 2 · Accepted Answer

我认为这不能完全回答你的问题，但我希望它能提供一些见解。

MPI 只是一个标准。它没有定义每个功能应该如何实现。因此，MPI 中某些任务的性能（在您的情况下为 MPI_Bcast 和 MPI_Reduce）严格基于您正在使用的实现。您可以使用性能比给定 MPI_Bcast 更好的点对点通信方法设计广播。

无论如何，您必须考虑这些功能中的每一个在做什么。广播是从一个进程获取信息并将其发送到所有其他进程；reduce 是从每个进程中获取信息并将其简化为一个进程。根据（最新）标准, MPI_Bcast 被认为是 One-to-All 集合操作，MPI_Reduce 被认为是 All-to-One 集合操作。因此，您对 MPI_Reduce 使用二叉树的直觉可能在这两种实现中都可以找到。但是，它很可能在 MPI_Bcast 中找不到。MPI_Bcast 可能是使用非阻塞点对点通信（从包含信息的进程发送到所有其他进程）实现的，在通信后等待全部。无论如何，为了弄清楚这两个函数是如何工作的，我建议深入研究 OpenMPI 和 MPICH2 实现的源代码。

score 0 · Accepted Answer

正如 Hristo 提到的，这取决于缓冲区的大小。如果您要发送一个大缓冲区，则广播将不得不进行大量大发送，而接收则对缓冲区进行一些本地操作以将其减少为单个值，然后仅传输该值而不是整个缓冲区.

c++ - 为什么 mpi_bcast 比 mpi_reduce 慢这么多？

2 回答 2

Related

Reference