2

当我在集群上的512 个进程上运行此简单代码时,我的 MPI 代码死锁我离内存限制还很远。如果我将进程数增加到2048,这对于这个问题来说太多了,代码会再次运行。死锁发生在包含.MPI_File_write_all

有什么建议么?

int count = imax*jmax*kmax;

// CREATE THE SUBARRAY
MPI_Datatype subarray;
int totsize [3] = {kmax, jtot, itot};
int subsize [3] = {kmax, jmax, imax};
int substart[3] = {0, mpicoordy*jmax, mpicoordx*imax};
MPI_Type_create_subarray(3, totsize, subsize, substart, MPI_ORDER_C, MPI_DOUBLE, &subarray);
MPI_Type_commit(&subarray);

// SET THE VALUE OF THE GRID EQUAL TO THE PROCESS ID FOR CHECKING
if(mpiid == 0) std::printf("Setting the value of the array\n");
for(int i=0; i<count; i++)
  u[i] = (double)mpiid;

// WRITE THE FULL GRID USING MPI-IO
if(mpiid == 0) std::printf("Write the full array to disk\n");
char filename[] = "u.dump";
MPI_File fh;
if(MPI_File_open(commxy, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY | MPI_MODE_EXCL, MPI_INFO_NULL, &fh))
  return 1;

// select noncontiguous part of 3d array to store the selected data
MPI_Offset fileoff = 0; // the offset within the file (header size)
char name[] = "native";

if(MPI_File_set_view(fh, fileoff, MPI_DOUBLE, subarray, name, MPI_INFO_NULL))
  return 1;

if(MPI_File_write_all(fh, u, count, MPI_DOUBLE, MPI_STATUS_IGNORE))
  return 1;

if(MPI_File_close(&fh))
  return 1;
4

1 回答 1

2

快速检查后,您的代码看起来正确。我建议您让您的 MPI-IO 库帮助您告诉您出了什么问题:您为什么不至少显示错误,而不是从错误中返回?以下是一些可能有帮助的代码:


static void handle_error(int errcode, char *str)
{
        char msg[MPI_MAX_ERROR_STRING];
        int resultlen;
        MPI_Error_string(errcode, msg, &resultlen);
        fprintf(stderr, "%s: %s\n", str, msg);
        MPI_Abort(MPI_COMM_WORLD, 1);
}

MPI_SUCCESS 是否保证为 0?我宁愿看


 errcode = MPI_File_routine();
 if (errcode != MPI_SUCCESS) handle_error(errcode, "MPI_File_open(1)");

把它放进去,如果你正在做一些棘手的事情,比如设置一个带有非单调非递减偏移量的文件视图,错误字符串可能会表明出了什么问题。

于 2012-09-15T19:06:30.450 回答