c++ - C++ MPI，使用多节点，先在节点级reduce，再reduce到头节点

Question

我使用 12 个节点的 windows HPC 集群（每个都有 24 个内核）来运行 C++ MPI 程序（使用 Boost MPI）。一次运行 MPI 减少，一次注释掉 MPI 减少（仅用于速度测试）。运行时间为 01:17:23 和 01:03:49。在我看来，MPI 减少需要很大一部分时间。我认为可能值得尝试首先在节点级别减少，然后减少到头节点以提高性能。

下面是一个用于测试目的的简单示例。假设有 4 个计算机节点，每个节点有 2 个核心。我想首先在每个节点上使用 mpi 来减少。之后，减少到头节点。我对 mpi 不太熟悉，下面的程序崩溃了。

#include <iostream>
#include <boost/mpi.hpp>
namespace mpi = boost::mpi;
using namespace std;

int main()
{
  mpi::environment env;
  mpi::communicator world;

  int i = world.rank();


  boost::mpi::communicator local = world.split(world.rank()/2); // total 8 cores, divide in 4 groups
  boost::mpi::communicator heads = world.split(world.rank()%4);

  int res = 0;

  boost::mpi::reduce(local, i, res, std::plus<int>(), 0);
  if(world.rank()%2==0)
  cout<<res<<endl;
  boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);

  if(world.rank()==0)
      cout<<res<<endl;

  return 0;
}

输出难以辨认，像这样

Z
h
h
h
h
a
a
a
a
n
n
n
n
g
g
g
g
\
\
\
\
b
b
b
b
o
o
o
o
o
o
o
o
s
...
...
...

错误信息是

Test.exe ended prematurely and may have crashed. exit code 3

我怀疑我在分组拆分/或减少方面做错了，但通过几次试验无法弄清楚。我该如何改变才能使这项工作？谢谢。

score 1 · Accepted Answer

现金的原因是因为您在以下行中将相同的变量两次传递给 MPI

boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);

这在 Boost.MPI 中并没有很好地记录，但是 boost 通过引用获取这些并将相应的指针传递给 MPI。MPI 通常禁止您将同一个缓冲区两次传递给同一个调用。准确地说，传递给 MPI 函数的输出缓冲区不得别名（重叠）到在此调用中传递的任何其他缓冲区。

您可以通过创建res.

我还认为您可能希望限制使用local.rank() == 0.

还要重申评论 - 我怀疑你会从重新实施减少中获得任何好处。试图优化瓶颈不完全理解的性能问题通常是一个坏主意。

c++ - C++ MPI，使用多节点，先在节点级reduce，再reduce到头节点

1 回答 1

Related

Reference