c++ - C++ 和 MPI 如何并行编写部分代码？

Question

我一直在使用 PETSc 库编写一些代码，现在我将更改其中的一部分以并行运行。我想要并行化的大部分事情是矩阵初始化以及我生成和计算大量值的部分。无论如何，如果我出于某种原因运行超过 1 个内核的代码，我的问题就会出现，所有代码部分的运行次数将与我使用的内核数一样多。

这只是我测试 PETSc 和 MPI 的简单示例代码

int main(int argc, char** argv)
{
    time_t rawtime;
    time ( &rawtime );
    string sta = ctime (&rawtime);
    cout << "Solving began..." << endl;

PetscInitialize(&argc, &argv, 0, 0);

  Mat            A;        /* linear system matrix */
  PetscInt       i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
  PetscErrorCode ierr;
  PetscBool      flg = PETSC_FALSE;
  PetscScalar    v;
#if defined(PETSC_USE_LOG)
  PetscLogStage  stage;
#endif

  /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
         Compute the matrix and right-hand-side vector that define
         the linear system, Ax = b.
     - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
  /* 
     Create parallel matrix, specifying only its global dimensions.
     When using MatCreate(), the matrix format can be specified at
     runtime. Also, the parallel partitioning of the matrix is
     determined by PETSc at runtime.

     Performance tuning note:  For problems of substantial size,
     preallocation of matrix memory is crucial for attaining good 
     performance. See the matrix chapter of the users manual for details.
  */
  ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
  ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
  ierr = MatSetFromOptions(A);CHKERRQ(ierr);
  ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
  ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
  ierr = MatSetUp(A);CHKERRQ(ierr);

  /* 
     Currently, all PETSc parallel matrix formats are partitioned by
     contiguous chunks of rows across the processors.  Determine which
     rows of the matrix are locally owned. 
  */
  ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);

  /* 
     Set matrix elements for the 2-D, five-point stencil in parallel.
      - Each processor needs to insert only elements that it owns
        locally (but any non-local elements will be sent to the
        appropriate processor during matrix assembly). 
      - Always specify global rows and columns of matrix entries.

     Note: this uses the less common natural ordering that orders first
     all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
     instead of J = I +- m as you might expect. The more standard ordering
     would first do all variables for y = h, then y = 2h etc.

   */
PetscMPIInt    rank;        // processor rank
PetscMPIInt    size;        // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);

cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;

cout << "Generating 2D-Array" << endl;

double temp2D[120000][3];
 for (Ii=Istart; Ii<Iend; Ii++) { 
    for(J=0; J<n;J++){
      temp2D[Ii][J] = 1;
    }
  }
  cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;

  v = -1.0;
  for (Ii=Istart; Ii<Iend; Ii++) { 
    for(J=0; J<n;J++){
       MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
   }
  }
  cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;

  /* 
     Assemble matrix, using the 2-step process:
       MatAssemblyBegin(), MatAssemblyEnd()
     Computations can be done while messages are in transition
     by placing code between these two statements.
  */
  ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
  ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);

    MPI_Finalize();
cout << "No more MPI" << endl;
return 0;

}

我的真实程序有几个不同的 .cpp 文件。我在主程序中初始化 MPI，调用另一个 .cpp 文件中的函数，在该文件中我确实实现了相同类型的矩阵填充，但是在填充矩阵之前程序所做的所有 cout 将打印与我的核心数量一样多的次数。

我可以将我的测试程序作为 mpiexec -n 4 test 运行，并且它运行成功，但由于某种原因，我必须将我的真实程序作为 mpiexec -n 4 ./myprog 运行

我的测试程序的输出如下

Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI

在两条评论后编辑：所以我的目标是在有 20 个节点且每个节点有 2 个核心的小型集群上运行它。稍后这应该在超级计算机上运行，所以 mpi 绝对是我需要走的路。我目前正在两台不同的机器上对此进行测试，其中一台有 1 个处理器/4 个内核，第二个有 4 个处理器/16 个内核。

score 5 · Accepted Answer

MPI是SPMD/MPMD模型（单程序多数据/多程序多数据）的实现。MPI 作业由同时运行的进程组成，这些进程在彼此之间交换消息以合作解决问题。您不能仅并行运行部分代码。您只能拥有不相互通信但仍同时执行的部分代码。您应该使用mpirun或mpiexec以并行模式启动您的应用程序。

如果您只想使部分代码并行化并且可以忍受只能在单台机器上运行代码的限制，那么您需要的是 OpenMP 而不是 MPI。或者您也可以根据 PETSc 网站使用低级 POSIX 线程编程，它支持pthreads. OpenMP 建立在此之上，pthreads因此将 PETSc 与 OpenMP 结合使用是可能的。

score 1 · Accepted Answer

为了补充 Hristo 的答案，MPI 被构建为以分布式方式运行，即完全独立的进程。它们必须分开，因为它们应该在不同的物理机器上。您可以在一台机器上运行多个 MPI 进程，例如每个内核一个。这完全没问题，但是 MPI 没有任何工具可以利用该共享内存上下文。换句话说，您不能让某些 MPI 等级（进程）在另一个 MPI 进程拥有的矩阵上工作，因为您无法共享该矩阵。

当您启动 x 个 MPI 进程时，您将获得 x 个正在运行的完全相同的程序的副本。你需要像这样的代码

if (rank == 0)
    do something
else
    do something else

让不同的进程做不同的事情。进程可以通过发送消息相互通信，但它们都运行完全相同的二进制文件。如果您没有代码分歧，那么您将获得同一程序的 x 个副本，给出 x 次相同的结果。

c++ - C++ 和 MPI 如何并行编写部分代码？

2 回答 2

Related

Reference