我对使用 Boost MPI 比较陌生。我已经安装了库,代码编译了,但是我遇到了一个非常奇怪的错误——从节点接收到的一些整数数据不是主节点发送的。到底是怎么回事?
我正在使用 boost 版本 1.42.0,使用 mpic++ 编译代码(它在一个集群上包装 g++,在另一个集群上包装 icpc)。下面是一个简化的示例,包括输出。
代码:
#include <iostream>
#include <boost/mpi.hpp>
using namespace std;
namespace mpi = boost::mpi;
class Solution
{
public:
Solution() :
solution_num(num_solutions++)
{
// Master node's constructor
}
Solution(int solutionNum) :
solution_num(solutionNum)
{
// Slave nodes' constructor.
}
int solutionNum() const
{
return solution_num;
}
private:
static int num_solutions;
int solution_num;
};
int Solution::num_solutions = 0;
int main(int argc, char* argv[])
{
// Initialization of MPI
mpi::environment env(argc, argv);
mpi::communicator world;
if (world.rank() == 0)
{
// Create solutions
int numSolutions = world.size() - 1; // One solution per slave
vector<Solution*> solutions(numSolutions);
for (int sol = 0; sol < numSolutions; ++sol)
{
solutions[sol] = new Solution;
}
// Send solutions
for (int sol = 0; sol < numSolutions; ++sol)
{
world.isend(sol + 1, 0, false); // Tells the slave to expect work
cout << "Sending solution no. " << solutions[sol]->solutionNum() << " to node " << sol + 1 << endl;
world.isend(sol + 1, 1, solutions[sol]->solutionNum());
}
// Retrieve values (solution numbers squared)
vector<double> values(numSolutions, 0);
for (int i = 0; i < numSolutions; ++i)
{
// Get values for each solution
double value = 0;
mpi::status status = world.recv(mpi::any_source, 2, value);
int source = status.source();
int sol = source - 1;
values[sol] = value;
}
for (int i = 1; i <= numSolutions; ++i)
{
world.isend(i, 0, true); // Tells the slave to finish
}
// Output the solutions numbers and their squares
for (int i = 0; i < numSolutions; ++i)
{
cout << solutions[i]->solutionNum() << ", " << values[i] << endl;
delete solutions[i];
}
}
else
{
// Slave nodes merely square the solution number
bool finished;
mpi::status status = world.recv(0, 0, finished);
while (!finished)
{
int solNum;
world.recv(0, 1, solNum);
cout << "Node " << world.rank() << " receiving solution no. " << solNum << endl;
Solution solution(solNum);
double value = static_cast<double>(solNum * solNum);
world.send(0, 2, value);
status = world.recv(0, 0, finished);
}
cout << "Node " << world.rank() << " finished." << endl;
}
return EXIT_SUCCESS;
}
在 21 个节点(1 个主节点,20 个从节点)上运行它会产生:
Sending solution no. 0 to node 1
Sending solution no. 1 to node 2
Sending solution no. 2 to node 3
Sending solution no. 3 to node 4
Sending solution no. 4 to node 5
Sending solution no. 5 to node 6
Sending solution no. 6 to node 7
Sending solution no. 7 to node 8
Sending solution no. 8 to node 9
Sending solution no. 9 to node 10
Sending solution no. 10 to node 11
Sending solution no. 11 to node 12
Sending solution no. 12 to node 13
Sending solution no. 13 to node 14
Sending solution no. 14 to node 15
Sending solution no. 15 to node 16
Sending solution no. 16 to node 17
Sending solution no. 17 to node 18
Sending solution no. 18 to node 19
Sending solution no. 19 to node 20
Node 1 receiving solution no. 0
Node 2 receiving solution no. 1
Node 12 receiving solution no. 19
Node 3 receiving solution no. 19
Node 15 receiving solution no. 19
Node 13 receiving solution no. 19
Node 4 receiving solution no. 19
Node 9 receiving solution no. 19
Node 10 receiving solution no. 19
Node 14 receiving solution no. 19
Node 6 receiving solution no. 19
Node 5 receiving solution no. 19
Node 11 receiving solution no. 19
Node 8 receiving solution no. 19
Node 16 receiving solution no. 19
Node 19 receiving solution no. 19
Node 20 receiving solution no. 19
Node 1 finished.
Node 2 finished.
Node 7 receiving solution no. 19
0, 0
1, 1
2, 361
3, 361
4, 361
5, 361
6, 361
7, 361
8, 361
9, 361
10, 361
11, 361
12, 361
13, 361
14, 361
15, 361
16, 361
17, 361
18, 361
19, 361
Node 6 finished.
Node 3 finished.
Node 17 receiving solution no. 19
Node 17 finished.
Node 10 finished.
Node 12 finished.
Node 8 finished.
Node 4 finished.
Node 15 finished.
Node 18 receiving solution no. 19
Node 18 finished.
Node 11 finished.
Node 13 finished.
Node 20 finished.
Node 16 finished.
Node 9 finished.
Node 19 finished.
Node 7 finished.
Node 5 finished.
Node 14 finished.
因此,当主节点发送 0 到节点 1、1 到节点 2、2 到节点 3 等时,大多数从节点(出于某种原因)接收到数字 19。因此,与其生成从 0 到 19 的数字的平方,我们得到 0 平方、1 平方和 19 平方 18 次!
提前感谢任何可以解释这一点的人。
艾伦