algorithm - MPI: Confusion with algorithm realization

Question

I have a plain code:

double eps;
A[N][N][N]; 

    ...


        for(i=1; i<=N-2; i++)
            for(j=1; j<=N-2; j++)
                for(k=1; k<=N-2; k++)
                {
                    A[i][j][k] = (A[i-1][j][k]+A[i+1][j][k])/2.;
                }
        for(i=1; i<=N-2; i++)
            for(j=1; j<=N-2; j++)
                for(k=1; k<=N-2; k++)
                {
                    A[i][j][k] = (A[i][j-1][k]+A[i][j+1][k])/2.;
                }
        for(i=1; i<=N-2; i++)
            for(j=1; j<=N-2; j++)
                for(k=1; k<=N-2; k++)
                {
                    double e;
                    e=A[i][j][k];
                    A[i][j][k] = (A[i][j][k-1]+A[i][j][k+1])/2.;
                    eps=Max(eps,fabs(e-A[i][j][k]));
                }

And i need to make a parallel code with usage MPI.

Ok, i understand, what to do with eps - it is global variable, that i need to compute everywhere. so, i create local variable, compute it and return result from each node. Or make reduce.

But what to do with matrix A? It must be shared by every node. How to synchronize every triple for construction? (if use see, that current A[i][j][k]-element is calculated with usage his neighbors - left and right A[i-1][][] A[i+1][][] or top and bottom A[][j+1][] A[][j-1][] or front and back A[][][k-1] A[][][k+1])

Thank you!

First Edition:

My first solution is to replace for constructions to minimize dependency from indexes such this:

        for(j=1; j<=N-2; j++)
            for(k=1; k<=N-2; k++)
//MPI here, Send processor (j,k) - coordinates of vector to compute next statement
                for(i=1; i<=N-2; i++)
                {
                    A[i][j][k] = (A[i-1][j][k]+A[i+1][j][k])/2.;
                }

and so on:

        for(i=1; i<=N-2; i++)
            for(k=1; k<=N-2; k++)
                for(j=1; j<=N-2; j++)
//here (i,k) is free dimensions, dependency only from j. send vector(i,k) to every processor
                {
                    A[i][j][k] = (A[i][j-1][k]+A[i][j+1][k])/2.;
                }

        for(i=1; i<=N-2; i++)
            for(j=1; j<=N-2; j++)
                for(k=1; k<=N-2; k++)
//dependency only from k, (i,j) are free. send it to processor
                {
                    double e;
                    e=A[i][j][k];
                    A[i][j][k] = (A[i][j][k-1]+A[i][j][k+1])/2.;
                    eps=Max(eps,fabs(e-A[i][j][k]));
                }

score 3 · Accepted Answer

Your algorithm exhibits data dependence, very similar to that of 3D FFT algorithms. No matter how you choose to distribute the data, it would not be the optimal distribution for one of the loops. E.g. if you distribute along the k axis, then the first and the second loop could be run in parallel, but not the last loop.

The solution is to transpose the matrix before the last loop. Global transposition of a distributed matrix is usually done using an all-to-all operation, either MPI_ALLTOALL or MPI_ALLTOALLV, depending if the matrix was divided in equally sized chinks or not (which generally depends on if the matrix size is divisible by the number of MPI processes or not). Take a look at this excellent answer by Jonathan Dursi. It is for the 2D case but can easily be extended to 3D too.

algorithm - MPI: Confusion with algorithm realization

1 回答 1

Related

Reference