mpi - 尽管 array_size 不能完全被进程数整除，但如何在 MPI 中的进程之间大致均匀地共享工作？

Question

大家好，我有一个长度为 N 的数组，我想在“大小”处理器之间尽可能地划分它。N/size 有余数，例如 1000 个数组元素除以 7 个进程，或 14 个进程除以 3 个进程。

我知道 MPI 中至少有两种工作共享方式，例如：

for (i=rank; i<N;i+=size){ a[i] = DO_SOME_WORK }

但是，这不会将数组分成连续的块，我想这样做，因为我认为出于 IO 原因更快。

我知道的另一个是：

int count = N / size;
int start = rank * count;
int stop = start + count;

// now perform the loop
int nloops = 0;

for (int i=start; i<stop; ++i)
{
    a[i] = DO_SOME_WORK;
}

但是，使用这种方法，对于我的第一个示例，我们得到 1000/7 = 142 = count。所以最后一个等级从 852 开始，到 994 结束。最后 6 行被忽略。

将这样的内容附加到以前的代码中是最好的解决方案吗？

int remainder = N%size;
int start = N-remainder; 
if (rank == 0){
     for (i=start;i<N;i++){
         a[i] = DO_SOME_WORK;
     }

这看起来很混乱，如果它是最好的解决方案，我很惊讶我在其他地方没有看到它。

谢谢你的帮助！

score 11 · Accepted Answer

如果我有N任务（例如，数组元素）和size工作人员（例如，MPI 等级），我会这样做：

int count = N / size;
int remainder = N % size;
int start, stop;

if (rank < remainder) {
    // The first 'remainder' ranks get 'count + 1' tasks each
    start = rank * (count + 1);
    stop = start + count;
} else {
    // The remaining 'size - remainder' ranks get 'count' task each
    start = rank * count + remainder;
    stop = start + (count - 1);
}

for (int i = start; i <= stop; ++i) { a[i] = DO_SOME_WORK(); }

这就是它的工作原理：

/*
  # ranks:                    remainder                     size - remainder
            /------------------------------------\ /-----------------------------\
     rank:      0         1             remainder-1                         size-1
           +---------+---------+-......-+---------+-------+-------+-.....-+-------+
    tasks: | count+1 | count+1 | ...... | count+1 | count | count | ..... | count |
           +---------+---------+-......-+---------+-------+-------+-.....-+-------+
                      ^       ^                            ^     ^
                      |       |                            |     |
   task #:  rank * (count+1)  |        rank * count + remainder  |
                              |                                  |
   task #:  rank * (count+1) + count   rank * count + remainder + count - 1

            \------------------------------------/ 
  # tasks:       remainder * count + remainder
*/

score 4 · Accepted Answer

这是一个封闭形式的解决方案。

令 N = 数组长度和 P = 处理器数量。

从j = 0 到P -1，

处理器上数组的起点j = floor( N * j / P )

处理器上的数组长度j = floor( N * ( j + 1) / P ) – floor( N * j / P )

score 3 · Accepted Answer

考虑您的“1000 个步骤和 7 个过程”示例。

简单的除法不起作用，因为整数除法（在 C 中）给了你发言权，你还剩下一些余数：即 1000 / 7 是 142，会有 6 个小玩意儿闲逛
天花板除法有相反的问题： ceil(1000/7) 是 143，但是最后一个处理器超出了阵列，或者最终比其他处理器少做。

您要求一种将剩余部分均匀分配给处理器的方案。一些流程应该有 142，其他流程应该有 143。必须有一个更正式的方法，但考虑到这个问题在过去六个月中得到的关注可能不是。

这是我的方法。每个进程都需要做这个算法，只为自己挑选出它需要的答案。

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char ** argv)
{
#define NR_ITEMS 1000
    int i, rank, nprocs;;
    int *bins;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    bins = calloc(nprocs, sizeof(int));

    int nr_alloced = 0;
    for (i=0; i<nprocs; i++) {
        remainder = NR_ITEMS - nr_alloced;
        buckets = (nprocs - i);
        /* if you want the "big" buckets up front, do ceiling division */
        bins[i] = remainder / buckets;
        nr_alloced += bins[i];
    }

    if (rank == 0)
        for (i=0; i<nprocs; i++) printf("%d ", bins[i]);

    MPI_Finalize();
    return 0;
}

score 1 · Accepted Answer

改进@Alexander 的回答：利用min来浓缩逻辑。

int count = N / size;
int remainder = N % size;
int start = rank * count + min(rank, remainder);
int stop = (rank + 1) * count + min(rank + 1, remainder);

for (int i = start; i < stop; ++i) { a[i] = DO_SOME_WORK(); }

score 1 · Accepted Answer

我知道这已经过去了，但是一个简单的方法是给每个进程（项目数）/（进程数）+（如果 process_num < num_items mod num_procs 则为 1）。在 python 中，具有工作计数的数组：

# Number of items
NI=128
# Number of processes
NP=20

# Items per process
[NI/NP + (1 if P < NI%NP else 0)for P in range(0,NP)]

score 0 · Accepted Answer

我认为最好的解决方案是为自己编写一个小函数，以便在进程之间足够均匀地分配工作。这是一些伪代码，我相信您可以比我更好地编写 C（您的问题中是 C 吗？）。

function split_evenly_enough(num_steps, num_processes)
    return = repmat(0, num_processes)  ! pseudo-Matlab for an array of num_processes 0s
    steps_per_process = ceiling(num_steps/num_processes)
    return = steps_per_process - 1 ! set all elements of the return vector to this number
    return(1:mod(num_steps, num_processes)) = steps_per_process  ! some processes have 1 more step
end

score 0 · Accepted Answer

这个怎么样？

int* distribute(int total, int processes) {
    int* distribution = new int[processes];
    int last = processes - 1;        

    int remaining = total;
    int process = 0;

    while (remaining != 0) {
        ++distribution[process];
        --remaining;

        if (process != last) {
            ++process;
        }
        else {
            process = 0;
        }
    }

    return distribution;
}

这个想法是您将一个元素分配给第一个进程，然后将一个元素分配给第二个进程，然后将一个元素分配给第三个进程，依此类推，只要到达最后一个进程，就会跳回第一个进程。

即使进程数大于元素数，此方法也有效。它只使用非常简单的操作，因此应该非常快。

score 0 · Accepted Answer

我遇到了类似的问题，这是我使用 Python 和 mpi4py API 的非最佳解决方案。最佳解决方案将考虑处理器的布局方式，此处将额外的工作分配给较低的等级。不均匀的工作量仅因一项任务而有所不同，因此一般情况下应该没什么大不了的。

from mpi4py import MPI
import sys
def get_start_end(comm,N):
    """
    Distribute N consecutive things (rows of a matrix , blocks of a 1D array)
    as evenly as possible over a given communicator.
    Uneven workload (differs by 1 at most) is on the initial ranks.

    Parameters
    ----------
    comm: MPI communicator
    N:  int
    Total number of things to be distributed.

    Returns
    ----------
    rstart: index of first local row
    rend: 1 + index of last row

    Notes
    ----------
    Index is zero based.
    """

    P      = comm.size
    rank   = comm.rank
    rstart = 0
    rend   = N
    if P >= N:
        if rank < N:
            rstart = rank
            rend   = rank + 1
        else:
            rstart = 0
            rend   = 0
    else:
        n = N//P # Integer division PEP-238
        remainder = N%P
        rstart    = n * rank
        rend      = n * (rank+1)
        if remainder:
            if rank >= remainder:
                rstart += remainder
                rend   += remainder
            else:
                rstart += rank
                rend   += rank + 1
    return rstart, rend

if __name__ == '__main__':
    comm = MPI.COMM_WORLD
    n = int(sys.argv[1])
    print(comm.rank,get_start_end(comm,n))

mpi - 尽管 array_size 不能完全被进程数整除，但如何在 MPI 中的进程之间大致均匀地共享工作？

8 回答 8

Related

Reference