过去几天我一直在尝试使用 MPI 在 C 中编写容错应用程序。我正在尝试学习如何将错误处理程序附加到 MPI_COMM_WORLD 通信器,以便在节点出现故障时(可能是由于崩溃)并在不调用 MPI_Finalize() 的情况下退出,程序仍然可以从这种情况中恢复并继续计算。
到目前为止,我遇到的问题是,在我将错误处理程序函数附加到通信然后导致节点崩溃之后,MPI 不会调用错误处理程序,而是强制所有线程退出。
我认为这可能是我的应用程序的问题,所以我在网上查找了示例代码并尝试运行它,但情况是一样的......我目前正在尝试运行的示例代码如下。(我从这里得到https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC4QFjAA&url=http%3A%2F%2Fwww.shodor.org% 2Fmedia%2Fcontent%2F%2Fpetascale%2Fmaterials%2FdistributedMemory%2Fpresentations%2FMPI_Error_Example.pdf&ei=jq6KUv-BBcO30QW1oYGABg&usg=AFQjCNFa5L_Q6Irg3VrJ3fsQBIyqjBlSgA&sig2=8An4SqBvhCACx5YLwBmROA apologies for being in pdf but i didnt write it, so I now paste the same code below):
/* Template for creating a custom error handler for MPI and a simple program
to demonstrate its' use. How much additional information you can obtain
is determined by the MPI binding in use at build/run time.
To illustrate that the program works correctly use -np 2 through -np 4.
To illustrate an MPI error set victim_mpi = 5 and use -np 6.
To illustrate a system error set victim_os = 5 and use -np 6.
2004-10-10 charliep created
2006-07-15 joshh updated for the MPI2 standard
2007-02-20 mccoyjo adapted for folding@clusters
2010-05-26 charliep cleaned-up/annotated for the petascale workshop
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "mpi.h"
void ccg_mpi_error_handler(MPI_Comm *, int *, ...);
int main(int argc, char *argv[]) {
MPI_Status status;
MPI_Errhandler errhandler;
int number, rank, size, next, from;
const int tag = 201;
const int server = 0;
const int victim_mpi = 5;
const int victim_os = 6;
MPI_Comm bogus_communicator;
MPI_Init(&argc, &argv);!
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_create_errhandler(&ccg_mpi_error_handler, &errhandler);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, errhandler);
next = (rank + 1) % size;
from = (rank + size - 1) % size;
if (rank == server) {
printf("Enter the number of times to go around the ring: ");
fflush(stdout);
scanf("%d", &number);
--number;
printf("Process %d sending %d to %d\n", rank, number, next);
MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
}
while (true) {
MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);
printf("Process %d received %d\n", rank, number);
if (rank == server) {
number--;
printf("Process 0 decremented number\n");
}
if (rank == victim_os) {
int a[10];
printf("Process %d about to segfault\n", rank);
a[15565656] = 56;
}
if (rank == victim_mpi) {
printf("Process %d about to go south\n", rank);
printf("Process %d sending %d to %d\n", rank, number, next);
MPI_Send(&number, 1, MPI_INT, next, tag, bogus_communicator);
} else {
printf("Process %d sending %d to %d\n", rank, number, next);
MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
}
if (number == 0) {
printf("Process %d exiting\n", rank);
break;
}
}
if (rank == server)
MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);
MPI_Finalize();
return 0;
}
void ccg_mpi_error_handler(MPI_Comm *communicator, int *error_code, ...) {
char error_string[MPI_MAX_ERROR_STRING];
int error_string_length;
printf("ccg_mpi_error_handler: entry\n");
printf("ccg_mpi_error_handler: error_code = %d\n", *error_code);
MPI_Error_string(*error_code, error_string, &error_string_length);
error_string[error_string_length] = '\0';
printf("ccg_mpi_error_handler: error_string = %s\n", error_string);
printf("ccg_mpi_error_handler: exit\n");
exit(1);
}
该程序实现了一个简单的令牌环,如果你给它注释中描述的参数,那么我会得到这样的东西:
>>>>>>mpirun -np 6 example.exe
Enter the number of times to go around the ring: 6
Process 1 received 5
Process 1 sending 5 to 2
Process 2 received 5
Process 2 sending 5 to 3
Process 3 received 5
Process 3 sending 5 to 4
Process 4 received 5
Process 4 sending 5 to 5
Process 5 received 5
Process 5 about to go south
Process 5 sending 5 to 0
Process 0 sending 5 to 1
[HP-ENVY-dv6-Notebook-PC:09480] *** Process received signal ***
[HP-ENVY-dv6-Notebook-PC:09480] Signal: Segmentation fault (11)
[HP-ENVY-dv6-Notebook-PC:09480] Signal code: Address not mapped (1)
[HP-ENVY-dv6-Notebook-PC:09480] Failing at address: 0xf0b397
[HP-ENVY-dv6-Notebook-PC:09480] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7fc0ec688cb0]
[HP-ENVY-dv6-Notebook-PC:09480] [ 1] /usr/lib/libmpi.so.0(PMPI_Send+0x74) [0x7fc0ec8f3704]
[HP-ENVY-dv6-Notebook-PC:09480] [ 2] example.exe(main+0x23f) [0x400e63]
[HP-ENVY-dv6-Notebook-PC:09480] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fc0ec2da76d]
[HP-ENVY-dv6-Notebook-PC:09480] [ 4] example.exe() [0x400b69]
[HP-ENVY-dv6-Notebook-PC:09480] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 9480 on node andres-HP-ENVY-dv6-Notebook-PC exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
显然,在我看到的输出中,没有任何一个printf()
被ccg_mpi_error_handler()
执行,所以我假设处理程序根本没有被调用。我不确定它是否有任何帮助,但我正在运行 ubuntu linux 12.04 并且我使用 apt-get 安装了 MPI。我用来编译程序的命令如下:
mpicc err_example.c -o example.exe
另外,当我这样做时,mpicc -v
我会得到以下信息:
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
非常感谢您的帮助!谢谢...