7

我尝试使用 OpenMP 编写简单的应用程序。不幸的是,我遇到了加速问题。在这个应用程序中,我有一个 while 循环。该循环的主体由一些应按顺序执行的指令和一个 for 循环组成。我#pragma omp parallel for用来使这个 for 循环并行。这个循环没有太多工作,但经常被调用。

我准备了两个版本的 for 循环,并在 1、2 和 4 核上运行应用程序。
版本 1(for 循环中的 4 次迭代):22 秒、23 秒、26 秒。
版本 2(for 循环中的 100000 次迭代):20 秒、10 秒、6 秒。

如您所见,当 for 循环没有太多工作时,2 核和 4 核的时间要高于 1 核。我猜原因是#pragma omp parallel for在 while 循环的每次迭代中都会创建新线程。所以,我想问你 - 是否有可能创建一次线程(在 while 循环之前),并确保 while 循环中的某些工作将按顺序完成?

#include <omp.h>
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char* argv[])
{
    double sum = 0;
    while (true)
    {
        // ...
        // some work which should be done sequentially
        // ...

        #pragma omp parallel for num_threads(atoi(argv[1])) reduction(+:sum)
        for(int j=0; j<4; ++j)  // version 2: for(int j=0; j<100000; ++j)
        {
            double x = pow(j, 3.0);
            x = sqrt(x);
            x = sin(x);
            x = cos(x);
            x = tan(x);
            sum += x;

            double y = pow(j, 3.0);
            y = sqrt(y);
            y = sin(y);
            y = cos(y);
            y = tan(y);
            sum += y;

            double z = pow(j, 3.0);
            z = sqrt(z);
            z = sin(z);
            z = cos(z);
            z = tan(z);
            sum += z;
        }

        if (sum > 100000000)
        {
            break;
        }
    }
    return 0;
}
4

2 回答 2

10

大多数 OpenMP 实现会在程序启动时创建多个线程,并在程序运行期间保留它们。也就是说,大多数实现不会在执行期间动态创建和销毁线程;这样做会以严重的线程管理成本影响性能。这种线程管理方法与 OpenMP 的常用用例一致,并且适用于这些用例。

当您增加 OpenMP 线程的数量时,您看到的减速很可能归结为在具有少量迭代的循环上施加了并行开销。Hristo的回答涵盖了这一点。

于 2012-05-15T08:09:40.633 回答
5

You could move the parallel region outside of the while (true) loop and use the single directive to make the serial part of the code to execute in one thread only. This will remove the overhead of the fork/join model. Also OpenMP is not really useful on thight loops with very small number of iterations (like your version 1). You are basically measuring the OpenMP overhead since the work inside the loop is done really fast - even 100000 iterations with transcendental functions take less than second on current generation CPU (at 2 GHz and roughly 100 cycles per FP instruciton other than addition, it'll take ~100 ms).

That's why OpenMP provides the if(condition) clause that can be used to selectively turn off the parallelisation for small loops:

#omp parallel for ... if(loopcnt > 10000)
for (i = 0; i < loopcnt; i++)
   ...

It is also advisable to use schedule(static) for regular loops (that is for loops in which every iteration takes about the same time to compute).

于 2012-05-14T20:49:23.560 回答