我正在尝试生成使用 MPI 的代码,其行为与具有依赖关系的程序大致相似。如果我使用多个处理器(例如 mpirun -np X),其中 X 大于我尝试建模的任务数(例如我的 switch 语句中的案例数),一切都会正常工作。我的程序模型有一个任务列表,每个任务的执行时间,以及一组任务之间的依赖关系。我生成的 MPI 代码看起来像这样(一个真实的案例会有 50 到 600 个任务,例如案例):
int main(int argc, char* argv[]) {
mpi::environment env(argc, argv);
mpi::communicator world;
long execution_times [4] = {9, 4, 3, 6};
switch (world.rank()) {
case 1: {
std::cout << "1: Awake" << std::endl;
mpi::request req[1];
req[0] = world.irecv(0, 0);
mpi::wait_all(req, req + 1);
std::cout << "1: Recv notice from pred 0" << std::endl;
time_t start;
start = time(NULL);
std::cout << "1: Started compute" << std::endl;
while ((time(NULL)-start) < execution_times[1]);
std::cout << "1: Finished compute in " << (time(NULL)-start) << std::endl;
mpi::request sreq[3];
sreq[0] = world.isend(5, 0);
sreq[1] = world.isend(23, 0);
sreq[2] = world.isend(42, 0);
mpi::wait_all(sreq, sreq + 3);
std::cout << "1: Sent notice to succ 5" << std::endl;
std::cout << "1: Sent notice to succ 23" << std::endl;
std::cout << "1: Sent notice to succ 42" << std::endl;
break; }
// Other cases excluded for brevity...
}
return 0;
}
我可以很好地编译g++ -L/usr/local/lib -lmpi -lmpi_cxx -lboost_serialization -lboost_mpi test.cpp
它并使用它运行它mpirun -np 4 a.out
但是,当遇到超出处理器数量的情况时,我总是会遇到异常,例如
hamiltont$ mpirun -np 2 a.out
0: Awake
0: Started compute
0: Finished compute in 0
1: Awake
1: Recv notice from pred 0
1: Started compute
libc++abi.dylib: terminate called throwing an exception
hamiltont$ mpirun -np 3 a.out
0: Awake
0: Started compute
0: Finished compute in 0
1: Awake
1: Recv notice from pred 0
1: Started compute
2: Awake
2: Recv notice from pred 0
2: Started compute
libc++abi.dylib: terminate called throwing an exception
请注意,将处理器数量从 2 个增加到 3 个可以让我成功执行一个案例。我在想我对 MPI 有一些不理解的地方
整个例外:
libc++abi.dylib: terminate called throwing an exception
[MacBook-Pro:47495] *** Process received signal ***
[MacBook-Pro:47495] Signal: Abort trap: 6 (6)
[MacBook-Pro:47495] Signal code: (0)
[MacBook-Pro:47495] [ 0] 2 libsystem_c.dylib 0x00007fff91e9b8ea _sigtramp + 26
[MacBook-Pro:47495] [ 1] 3 ??? 0x0000000000000000 0x0 + 0
[MacBook-Pro:47495] [ 2] 4 libc++abi.dylib 0x00007fff8f29ca17 abort_message + 257
[MacBook-Pro:47495] [ 3] 5 libc++abi.dylib 0x00007fff8f29a3c6 _ZL17default_terminatev + 28
[MacBook-Pro:47495] [ 4] 6 libobjc.A.dylib 0x00007fff94857887 _ZL15_objc_terminatev + 111
[MacBook-Pro:47495] [ 5] 7 libc++abi.dylib 0x00007fff8f29a3f5 _ZL19safe_handler_callerPFvvE + 8
[MacBook-Pro:47495] [ 6] 8 libc++abi.dylib 0x00007fff8f29a450 __cxa_bad_typeid + 0
[MacBook-Pro:47495] [ 7] 9 libc++abi.dylib 0x00007fff8f29b5b7 _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
[MacBook-Pro:47495] [ 8] 10 a.out 0x00000001086a818e _ZN5boost15throw_exceptionINS_3mpi9exceptionEEEvRKT_ + 158
[MacBook-Pro:47495] [ 9] 11 libboost_mpi.dylib 0x0000000108a061e7 _ZNK5boost3mpi12communicator5isendEii + 111
[MacBook-Pro:47495] [10] 12 a.out 0x0000000108676fc9 main + 1257
[MacBook-Pro:47495] [11] 13 libdyld.dylib 0x00007fff911837e1 start + 0
[MacBook-Pro:47495] [12] 14 ??? 0x0000000000000001 0x0 + 1
[MacBook-Pro:47495] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 47495 on node MacBook-Pro.local exited on signal 6 (Abort trap: 6).