c++ - OpenMP 性能

Question

首先，我知道这种 [type of] 问题经常被问到，所以让我先说我已经阅读了尽可能多的内容，但我仍然不知道交易是什么。

我已经并行化了一个巨大的外部 for 循环。循环迭代的次数变化，通常在 20-150 之间，但循环体做了大量的工作，调用了很多本地密集的线性代数例程（例如，代码是源代码的一部分，而不是外部依赖项） . 在循环体中，有 1000 多个对这些例程的调用，但它们完全相互独立，所以我认为这将是并行性的主要候选者。循环代码是 C++，但它调用了很多用 C 编写的子例程。

代码如下所示；

<declare and initialize shared variables here>
#ifdef _OPENMP
#pragma omp parallel for                            \
  private(....)\
  shared(....)              \
  firstprivate(....) schedule(runtime)
#endif
  for(tst = 0; tst < ntest; tst++) {

     // Lots of functionality (science!)
     // Calls to other deep functions which manipulate private variables only
     // Call to function which has 1000 loop iterations doing matrix manipulation
     // With no exaggeration, there are probably millions 
     // of for-loop iterations in this body, in the various functions called. 
     // They also do lots of mallocing and freeing
     // Finally generated some calculated_values

     shared_array1[tst] = calculated_value1;
     shared_array2[tst] = calculated_value2;
     shared_array3[tst] = calculated_value3;

 } // end of parallel and for

// final tidy up

我相信根本不应该有任何同步——线程访问共享变量的唯一时间是shared_arrays，并且它们访问这些数组中的唯一点，索引为tst。

问题是，当我增加线程数量时（在多核集群上！）我们看到的速度（我们调用这个循环 5 次）如下；

              Elapsed time   System time
 Serial:        188.149          1.031
 2 thrds:       148.542          6.788
 4 thrds:       309.586        424.037       # SAY WHAT?
 8 thrds:       230.290        568.166  
16 thrds:       219.133        799.780

可能值得注意的是系统时间在 2 到 4 个线程之间的巨大跳跃，事实上，当我们从 2 到 4 时，经过的时间翻了一番，然后慢慢减少。

我已经尝试过大量的OMP_SCHEDULE参数，但没有运气。这是否与每个线程大量使用 malloc/new 和 free/delete 的事实有关？这一直以 8GB 内存运行 - 但我猜这不是问题。坦率地说，系统时间的巨大增加使得线程看起来像是阻塞了，但我不知道为什么会发生这种情况。

更新 1 我真的认为错误共享会成为问题，所以重新编写了代码，以便循环将它们的计算值存储在线程本地数组中，然后将这些数组复制到最后的共享数组中。可悲的是，这没有任何影响，尽管我自己几乎不相信。

按照@cmeerw 的建议，我运行了 strace -f，在所有初始化之后只有数百万行

[pid 58067] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58065] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 57684] <... futex resumed> )       = 0
[pid 58067] <... futex resumed> )       = 0
[pid 58066] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58067] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = 0
[pid 57684] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58065] <... futex resumed> )       = 0
[pid 58067] <... futex resumed> )       = 0
[pid 57684] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58066] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 57684] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] <... futex resumed> )       = 0
[pid 58066] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 57684] <... futex resumed> )       = 0
[pid 58067] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58066] <... futex resumed> )       = 0
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58067] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58066] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 57684] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] <... futex resumed> )       = 0
[pid 58067] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 57684] <... futex resumed> )       = 0
[pid 58067] <... futex resumed> )       = 0
[pid 58066] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58065] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58066] <... futex resumed> )       = 0
[pid 58065] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58066] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 57684] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 58067] futex(0x35ca58bb40, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 58066] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 58065] futex(0x35ca58bb40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 57684] <... futex resumed> )       = 0

任何人有任何想法是什么意思？看起来线程过于频繁地进行上下文切换，或者只是阻塞和解除阻塞？当我设置为 0strace的相同实现时，OMP_NUM_THREADS我根本没有得到这些。比较一下，使用 1 个线程时生成的日志文件为 486 KB，使用 4 个线程时生成的日志文件为 266 MB。

换句话说，并行版本调用了额外的 4170104 行日志文件......

更新 2

正如汤姆所建议的那样，我尝试将线程绑定到特定的处理器无济于事。我们在 OpenMP 3.1 中，所以我使用export OMP_PROC_BIND=true. 相同大小的日志文件和相同的时间范围。

更新 3

情节变厚了。到目前为止只在集群上进行了分析，我通过 Macports 安装了 GNU GCC 4.7，并第一次在我的 Macbook 上编译（使用 openMP）（Apple 的 GCC-4.2.1 在启用 OpenMP 时会引发编译器错误，这就是为什么我直到现在还没有在本地编译和并行运行它）。在 Macbook 上，您基本上可以看到您所期望的趋势

                C-code time
 Serial:         ~34 seconds
 2 thrds:        ~21 seconds
 4 thrds:        ~14 seconds
 8 thrds:        ~12 seconds
16 thrds:         ~9 seconds

我们看到接近尾声的回报递减，尽管这并不奇怪，因为我们在这个测试数据上迭代的几个数据集有 <16 个成员（所以，我们生成了 16 个线程，比如for-loop7 次迭代） .

所以，现在问题仍然存在——为什么集群的性能下降得这么厉害。今晚我将尝试不同的四核 linuxbox。集群使用 GNU-GCC 4.6.3 编译，但我不敢相信它本身会产生如此大的影响？

既没有ltrace也没有GDB安装在集群上（由于各种原因我无法安装它们）。如果我的 linuxbox 提供类似集群的性能，我将在ltrace那里运行相应的分析。

更新 4

天啊。我决斗将我的 Macbook Pro 引导到 Ubuntu (12.04) 并重新运行代码。这一切都在运行（这有点让人放心），但我看到了我在集群上看到的相同的、奇怪的不良性能行为，以及数百万次futex调用的相同运行。鉴于我在 Ubuntu 和 OSX 中的本地计算机之间的唯一区别是软件（而且我使用相同的编译器和库 - 大概glibcOSX 和 Ubuntu 没有不同的实现！）我现在想知道这是否适合与 Linux 如何调度/分配线程有关。无论如何，在我的本地机器上让一切变得简单一百万倍，所以我将继续前进，ltrace -f看看我能找到什么。我为集群写了一个解决方法forks()关闭一个单独的进程，并在运行时提供完美的 1/2，因此绝对有可能获得并行性......

score 8 · Accepted Answer

因此，在进行了一些相当广泛的分析之后（感谢这篇关于 gprof 和 gdb 时间采样的信息的精彩帖子），其中涉及编写一个大型包装函数来生成用于分析的生产级代码，很明显，在我的绝大多数时间里使用 gdb 中止正在运行的代码并运行backtrace堆栈在STL <vector>调用中，以某种方式操作向量。

该代码将一些向量parallel作为私有变量传递到该部分，这似乎工作正常。然而，在取出所有向量并用数组替换它们（以及其他一些使这项工作正常工作的诡计）之后，我看到了显着的加速。使用小的人工数据集，加速几乎是完美的（即，当你将线程数加倍时，一半的时间），而使用真实数据集，加速并不那么好，但这在上下文中是完全有意义的代码是如何工作的。

似乎无论出于何种原因（可能是STL<vector>实现中的一些静态或全局变量？）当循环通过数十万次并行迭代时，都会出现一些深层锁定，这发生在 Linux（Ubuntu 12.01 和 CentOS 6.2）中但不是在 OSX 中。

我真的很想知道为什么我会看到这种差异。STL 的实现方式是否有所不同（OSX 版本是在 GNU GCC 4.7 下编译的，Linux 也是如此），或者这与上下文切换有关（如 Arne Babenhauserheide 所建议的）

总之，我的调试过程如下；

从内部进行初步分析R以识别问题
确保没有static变量充当共享变量
分析strace -f并且ltrace -f这对于识别锁定是罪魁祸首非常有帮助
分析以valgrind查找任何错误
尝试了各种计划类型（自动、引导、静态、动态）和块大小的组合。
尝试将线程绑定到特定处理器
通过为值创建线程本地缓冲区来避免错误共享，然后在结束时实现单个同步事件for-loop
从并行区域内删除了所有mallocing和freeing- 对问题没有帮助，但确实提供了一个小的一般加速
尝试了各种架构和操作系统——最终并没有真正帮助，但确实表明这是 Linux 与 OSX 的问题，而不是超级计算机与桌面的问题
构建一个使用调用实现并发的版本fork()——在两个进程之间有工作量。这将 OSX 和 Linux 上的时间减半，这很好
构建了一个数据模拟器来复制生产数据负载
gprof 分析
gdb 时间采样分析（中止和回溯）
注释掉向量操作
如果这不起作用，Arne Babenhauserheide 的链接看起来很可能有一些关于 OpenMP 内存碎片问题的关键内容

score 4 · Accepted Answer

It's hard to know for sure what is happening without significant profiling, but the performance curve seems indicative of False Sharing...

threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time

Great article on the topic at Dr Dobbs

http://www.drdobbs.com/go-parallel/article/217500206?pgno=1

In particular the fact that the routines are doing a lot of malloc/free could lead to this.

One solution is to use a pool based memory allocator rather than the default allocator so that each thread tends to allocate memory from a different physical address range.

score 2 · Accepted Answer

由于线程实际上不交互，您可以将代码更改为多处理。最后你只会有消息传递，并且可以保证线程不需要同步任何东西。

这是 python3.2-code 基本上可以做到这一点（出于性能原因，您可能不想在 python 中执行此操作 - 或者将 for-loop 放入 C 函数并通过 cython 绑定它。您将从代码中看到为什么我还是用 Python 展示它）：

from concurrent import futures
from my_cython_module import huge_function
parameters = range(ntest)
with futures.ProcessPoolExecutor(4) as e:
    results = e.map(huge_function, parameters)
    shared_array = list(results)

而已。将进程数增加到可以放入集群的作业数，并让每个进程只需提交和监视一个作业以扩展到任意数量的调用。

没有交互的巨大函数和小的输入值几乎需要多处理。一旦你有了它，切换到 MPI（几乎无限扩展）并不太难。

从技术方面来看，Linux 中的 AFAIK 上下文切换非常昂贵（具有大量内核空间内存的单片内核），而在 OSX 或 Hurd（Mach 微内核）上要便宜得多。这可能解释了您在 Linux 上看到的大量系统时间，但在 OSX 上却没有。

c++ - OpenMP 性能

3 回答 3

Related

Reference