java - 大型数组中元素的并行总和

Question

我有一个程序可以将一个非常大的数组中的元素相加。我想并行化这个总和。

#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
float sum=0.0;
main()
{

for (i=0, i<N, i++)

sum=sum+x[i];

}

如何使用线程并行化这个总和（c/c++/Java 任何代码示例都可以）？如果机器中有 8 个内核，我应该使用多少线程以获得最佳性能？

编辑： N 可能真的很大（实际上大于 1e6），并且根据我从中读取数据的文件大小而有所不同。该文件的顺序为 GB。

编辑：N 更改为较大的值（1e12 到 1e16）

score 3 · Accepted Answer

在Java中你可以写

int cpus = Runtime.getRuntime().availableProcessors();
// would keep this of other tasks as well.
ExecutorService service = Executors.newFixedThreadPool(cpus);

float[] floats = new float[N];

List<Future<Double>> tasks = new ArrayList<>();
int blockSize = (floats.length + cpus - 1) / cpus;
for (int i=0, i < floats.length, i++) {
    final start = blockSize * i;
    final end = Math.min(blockSize * (i+1), floats.length);
    tasks.add(service.submit(new Callable<Double>() {
        public Double call() {
            double d= 0;
            for(int j=start;j<end;j++)
                d += floats[j];
            return d;
        }
     });
}
double sum = 0;
for(Future<Double> task: tasks)
    sum += task.get();

正如 WhozCraig 所提到的，一百万个浮点数可能不足以需要多个线程，或者您可能会发现您的瓶颈是您可以从主内存（单线程资源）加载数组的速度在任何情况下，您不能假设当您包括获取数据的成本时它会更快。

score 3 · Accepted Answer

你说数组来自一个文件。如果您对程序的不同部分进行计时，您会发现与从磁盘读取数据所需的时间相比，汇总元素所需的时间可以忽略不计。从阿姆达尔定律可以得出结论，并行化求和不会有任何收获。

如果你需要提高性能，你应该专注于提高 I/O 吞吐量。

score 2 · Accepted Answer

您可以使用许多线程（多于核心）。但是没有线程及其性能取决于您的算法以及它们的工作方式。由于数组长度为 100000，因此创建 x 个线程，每个线程将计算 arr[x] 到 arr[x+limit]。您必须在其中设置限制，以免与其他线程重叠并且任何元素都不应保持未使用状态。线程创建：

   pthread_t tid[COUNT];
    int i = 0;
        int err;
        while (i < COUNT) 
        {
            void *arg;
            arg = x; //pass here a no which will tell from where this thread will use arr[x]
            err = pthread_create(&(tid[i]), NULL, &doSomeThing, arg);
            if (err != 0)
                printf("\ncan't create thread :[%s]", strerror(err));
            else
            {
                //printf("\n Thread created successfully\n");
            }

            i++;
        }
       // NOW CALCULATE....
        for (int i = 0; i < COUNT; i++) 
        {
            pthread_join(tid[i], NULL);
        }
}

void* doSomeThing(void *arg) 
{
    int *x;
    x = (int *) (arg);
   // now use this x to start the array sum from arr[x] to ur limit which should not overlap to other thread
}

score 0 · Accepted Answer

Use divide and conquer algorithm

Divide the array into 2 or more (keep dividing recursively until you get an array with manageable size)
Start computing the sum for the sub arrays (divided arrays) (using separate threads)
Finally add the sum generated (from all the threads) for all sub arrays together to produce final result

score 0 · Accepted Answer

正如其他人所说，读取文件的时间成本几乎肯定会比计算总和的时间成本大得多。它是文本文件还是二进制文件？如果数字存储为文本，那么读取它们的成本可能会非常高，具体取决于您的实现。

您还应该小心添加大量浮点数。由于精度有限，数组后面的小值可能对总和没有贡献。考虑至少使用双精度来累积值。

score 0 · Accepted Answer

您可以在 c 中使用 pthreads 来解决您的问题这是我的 N=4 代码（您可以更改它以满足您的需要）要运行此代码，请应用以下命令： gcc -pthread test.c -o test ./test

#include<stdio.h>
#include<stdlib.h>
#include<pthread.h>

#define NUM_THREADS 5
pthread_t threads[NUM_THREADS];
pthread_mutex_t mutexsum;
int  a[2500];
int sum = 0;
void *do_work(void* parms) {

    long tid = (long)parms;
printf("I am thread # %ld\n ", tid);

    int start, end, mysum;

    start = (int)tid * 500;
    end = start + 500;
    int i = 0;
printf("Thread # %ld with start = %d and end = %d \n",tid,start,end);
    for (int i = start; i < end; i++) {
        mysum += a[i];
    }
    pthread_mutex_lock(&mutexsum);
printf("Thread # %ld lock and sum = %d\n",tid,sum);
    sum += mysum;
    pthread_mutex_unlock(&mutexsum);

pthread_exit(NULL);


}
void main(int argv, char* argc) {
    int i = 0; int rc;
pthread_attr_t attr;
         pthread_mutex_init(&mutexsum, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
    pthread_mutex_init(&mutexsum, NULL);
printf("Initializing array : \n");
for(i=0;i<2500;i++){
a[i]=1;
}
    for (i = 0; i < NUM_THREADS; i++) {
        printf("Creating thread # %d.\n", i);

        rc = pthread_create(&threads[i], &attr, &do_work, (void *)i);
        if (rc) {
            printf("Error in thread %d with rc  = %d. \n", i, rc);
            exit(-1);
        }

    }
pthread_attr_destroy(&attr);
printf("Creating threads complete. start ruun " );
    for (i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);

    }
    printf("\n\tSum : %d", sum);
pthread_mutex_destroy(&mutexsum);
    pthread_exit(NULL);
}

score 0 · Accepted Answer

OpenMP 支持内置缩减。编译时添加标志 -fopenmp。

#include <omp.h>
#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
int main()
{

float sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i=0, i<N, i++)
  sum=sum+x[i];

return 0;
}

java - 大型数组中元素的并行总和

7 回答 7

Related

Reference