multithreading - 使用更多线程时 OpenMP 性能最差（遵循 openMP 教程）

Question

我开始使用 OpenMP 并遵循以下教程：

我正在对视频中出现的内容进行准确编码，但不是通过更多线程获得更好的性能，而是变得更糟。我不明白为什么。

这是我的代码：

#include <iostream>
#include <time.h>
#include <omp.h>

using namespace std;

static long num_steps = 100000000;
double step;

#define NUM_THREADS 2

int main()
{
    clock_t t;
    t = clock();
    int i, nthreads; double pi, sum[NUM_THREADS];
    step = 1.0/(double)num_steps;

    omp_set_num_threads(NUM_THREADS);
    #pragma omp parallel
    {
        int i, id, nthrds;
        double x;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if(id == 0) nthreads = nthrds;
        for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
        {
            x = (i+0.5)*step;
            sum[id] += 4.0/(1.0+x*x);
        }
    }
    for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;

    t = clock() - t;
    cout << "time: " << t << " miliseconds" << endl;

}

如您所见，它与视频中的完全相同，我只是添加了一个代码来测量经过的时间。

在本教程中，我们使用的线程越多，性能就越好。

就我而言，这不会发生。这是我得到的时间：

1 thread:   433590 miliseconds
2 threads: 1705704 miliseconds
3 threads: 2689001 miliseconds
4 threads: 4221881 miliseconds

为什么我会出现这种行为？

- 编辑 -

gcc 版本：gcc 5.5.0

lscpu的结果：

Architechure: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4720HQ CPU @ 2.60Ghz
Stepping: 3
CPU Mhz: 2594.436
CPU max MHz: 3600,0000
CPU min Mhz: 800,0000
BogoMIPS: 5188.41
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7

- 编辑 -

我尝试过使用omp_get_wtime()，如下所示：

#include <iostream>
#include <time.h>
#include <omp.h>

using namespace std;

static long num_steps = 100000000;
double step;

#define NUM_THREADS 8

int main()
{
    int i, nthreads; double pi, sum[NUM_THREADS];
    step = 1.0/(double)num_steps;
    double start_time = omp_get_wtime();

    omp_set_num_threads(NUM_THREADS);
    #pragma omp parallel
    {
        int i, id, nthrds;
        double x;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if(id == 0) nthreads = nthrds;
        for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
        {
            x = (i+0.5)*step;
            sum[id] += 4.0/(1.0+x*x);
        }
    }
    for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
    double time = omp_get_wtime() - start_time;

    cout << "time: " << time << " seconds" << endl;

}

行为是不同的，尽管我有一些问题。

现在，如果我将线程数增加1，例如1线程，2线程，3、4，...，结果与以前基本相同，性能变差，虽然如果我增加到64线程，或 128 个线程，我确实获得了更好的性能，时间从0.44 [s]（对于 1 个线程）减少到0.13 [s]（对于 128 个线程）。

我的问题是：为什么我的行为与教程中的行为不同？

2 线程比 1 获得更好的性能，
3 线程获得比 2 更好的性能，依此类推。

为什么只有更多的线程才能获得更好的性能？

score 0 · Accepted Answer

而不是更好的性能与更多的线程我变得更糟......我不明白为什么。

好吧，
让我们让测试更加系统和可重复
，看看是否：

// time: 1535120 milliseconds    1 thread
// time:  200679 milliseconds    1 thread  -O2  
// time:  191205 milliseconds    1 thread  -O3
// time:  184502 milliseconds    2 threads -O3
// time:  189947 milliseconds    3 threads -O3 
// time:  202277 milliseconds    4 threads -O3 
// time:  182628 milliseconds    5 threads -O3
// time:  192032 milliseconds    6 threads -O3
// time:  185771 milliseconds    7 threads -O3
// time:  187606 milliseconds   16 threads -O3
// time:  187231 milliseconds   32 threads -O3
// time:  186131 milliseconds   64 threads -O3

参考：一些样品在 TiO.RUN 平台快速模型上运行......其中有限的资源应用特定的玻璃天花板来击中......

这确实显示了更多 -{ -O2 |-O3 }编译模式优化效果的效果，而不是上面提出的增加线程数量的主要退化。

接下来是来自非托管代码执行生态系统的“背景”噪音，其中 O/S 很容易扭曲简单的性能基准测试

如果确实对更多细节感兴趣，请随意阅读收益递减定律（关于现实世界的组成[SERIAL]，分别[PARALLEL]是进程调度的部分），Gene AMDAHL 博士在其中发起了主要规则， 为什么更多线程不获得更好的性能（并且该定律的更现代的重新制定解释了为什么更多的线程甚至可能获得负面改进（获得更昂贵的附加开销），而不是正确调整的峰值性能。

#include <time.h>
#include <omp.h>

#include <stdio.h>
#include <stdlib.h>

using namespace std;

static long   num_steps = 100000000;
       double step;

#define NUM_THREADS 7

int main()
{
    clock_t t;
    t = clock();

    int i, nthreads; double pi, sum[NUM_THREADS];
    step = 1.0 / ( double )num_steps;

    omp_set_num_threads( NUM_THREADS );

 // struct timespec                  start;
 // t = clock(); // _________________________________________ BEST START HERE
 // clock_gettime( CLOCK_MONOTONIC, &start ); // ____________ USING MONOTONIC CLOCK
    #pragma omp parallel
    {
        int    i,
               nthrds = omp_get_num_threads(),
               id     = omp_get_thread_num();;
        double x;

        if ( id == 0 ) nthreads = nthrds;

        for ( i =  id, sum[id] = 0.0;
              i <  num_steps;
              i += nthrds
              )
        {
            x = ( i + 0.5 ) * step;
            sum[id] += 4.0 / ( 1.0 + x * x );
        }
    }

 // t = clock() - t; // _____________________________________ BEST STOP HERE
 // clock_gettime( CLOCK_MONOTONIC, &end ); // ______________ USING MONOTONIC CLOCK
    for ( i =  0, pi = 0.0;
          i <  nthreads;
          i++
          ) pi += sum[i] * step;

    t = clock() - t;
 //                                                  // time: 1535120 milliseconds    1 thread
 //                                                  // time:  200679 milliseconds    1 thread  -O2  
 //                                                  // time:  191205 milliseconds    1 thread  -O3
    printf( "time: %d milliseconds %d threads\n",    // time:  184502 milliseconds    2 threads -O3
             t,                                      // time:  189947 milliseconds    3 threads -O3 
             NUM_THREADS                             // time:  202277 milliseconds    4 threads -O3 
             );                                      // time:  182628 milliseconds    5 threads -O3
}                                                    // time:  192032 milliseconds    6 threads -O3
                                                     // time:  185771 milliseconds    7 threads -O3

score 0 · Accepted Answer

该版本的主要问题是虚假共享。稍后在您开始观看的视频中对此进行了解释。当许多线程正在访问内存（sum数组）中相邻的数据时，您会得到这个。该视频还解释了如何使用填充来手动避免此问题。

也就是说，惯用的解决方案是使用减少，甚至不打扰手动工作共享：

double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < num_steps; i++)
{
    double x = (i+0.5)*step;
    sum += 4.0/(1.0+x*x);
}

这也在该系列的后续视频中进行了解释。它比您开始使用的方法要简单得多，而且很可能是最有效的方法。

虽然演示者确实很称职，但这些 OpenMP 教程视频的风格非常自下而上。我不确定这是一种好的教育方法。无论如何，您应该观看所有视频以了解如何在实践中最好地使用 OpenMP。

为什么只有更多的线程才能获得更好的性能？

这有点违反直觉，使用更多的 OpenMP 线程很少能获得比硬件线程更好的性能 - 除非这间接解决了另一个问题。在您的情况下，大量线程意味着 sum数组分布在内存中更大的区域，并且不太可能发生错误共享。

multithreading - 使用更多线程时 OpenMP 性能最差（遵循 openMP 教程）

2 回答 2

好吧，让我们让测试更加系统和可重复，看看是否：

Related

Reference

好吧，
让我们让测试更加系统和可重复
，看看是否：