c++ - OpenMP：不要使用超线程内核（半 `num_threads()` w/超线程）

Question

在g++ 4.7 中的 OpenMP (parallel for) 效率不是很高吗？2.5x at 5x CPU，我确定我的程序的性能在 11s 和 13s 之间变化（通常总是在 12s 以上，有时慢到 13.4s），在使用默认值时 CPU 大约为 500% #pragma omp parallel for，而 OpenMP 的速度提升仅为在g++-4.7 -O3 -fopenmp4 核 8 线程 Xeon 上，5 倍 CPU 时 2.5 倍。

我尝试使用schedule(static) num_threads(4)，并注意到我的程序总是在 11.5 秒到 11.7 秒（总是低于 12 秒）内以大约 320% 的 CPU 完成，例如，运行更一致，并且使用更少的资源（即使最好的运行比慢半秒超线程的罕见异常值）。

是否有任何简单的 OpenMP 方式来检测超线程，并减少num_threads()到 CPU 内核的实际数量？

（有一个类似的问题，Poor performance due to hyper-threading with OpenMP: how to bind threads to cores，但在我的测试中，我发现仅仅从 8 个线程减少到 4 个线程就已经完成了 w/g++-4.7 的工作在 Debian 7 wheezy 和 Xeon E3-1240v3 上，所以，这个问题仅仅是关于减少num_threads()内核数量。）

score 2 · Accepted Answer

如果您在 Linux 下运行 [也假设是 x86 架构]，您可以查看/proc/cpuinfo. 有两个字段cpu cores和siblings。第一个是 [真实] 内核的数量，后者是超线程的数量。（例如，在我的系统上，对于我的四核超线程机器，它们分别是 4 和 8）。

因为 Linux 可以检测到这一点 [以及来自 Zulan 评论中的链接]，因此该信息也可以从 x86cpuid指令中获得。

无论哪种方式，还有一个环境变量：OMP_NUM_THREADS它可能更容易与启动器/包装器脚本一起使用

您可能希望考虑的一件事是，超过一定数量的线程，您可能会使内存总线饱和，并且线程[或内核]的增加不会提高性能，实际上可能会降低性能。

从这个问题：Atomically increment two integers with CAS有一个来自 CppCon 2015 的视频谈话的链接，它分为两部分：https ://www.youtube.com/watch?v=lVBvHbJsg5Y和https://www.youtube .com/watch?v=1obZeHnAwz4

它们每个大约 1.5 小时，但是，IMO，非常值得。

在演讲中，演讲者[做过很多多线程/多核优化]说，根据他的经验，内存总线/系统在大约四个线程后趋于饱和。

score 0 · Accepted Answer

超线程是英特尔对同时多线程 (SMT)的实现。当前的 AMD 处理器没有实现 SMT（Bulldozer 微架构系列有其他 AMD 称为基于集群的多线程，但 Zen 微架构应该有 SMT）。OpenMP 没有内置支持来检测 SMT。

如果您想要一个通用功能来检测超线程，您需要支持不同代的处理器，并确保处理器是 Intel 处理器而不是 AMD。最好为此使用库。

但是您可以使用 OpenMP 创建一个适用于许多现代 Intel 处理器的函数，正如我在此处描述的那样。

以下代码将计算现代英特尔处理器上的物理内核数量（它在我尝试过的每个英特尔处理器上都有效）。您必须绑定线程才能使其正常工作。使用 GCC，您可以使用，export OMP_PROC_BIND=true否则您可以使用代码绑定（这就是我所做的）。

请注意，我不确定此方法对 VirtualBox 是否可靠。使用 VirtualBox 在 4 核/8 逻辑处理器 CPU 上，windows 作为主机，Linux 作为猜测，将 VM 的内核数设置为 4，此代码报告 2 个内核，/proc/cpuinfo 显示其中两个内核实际上是逻辑处理器。

#include <stdio.h>

//cpuid function defined in instrset_detect.cpp by Agner Fog (2014 GNU General Public License)
//http://www.agner.org/optimize/vectorclass.zip

// Define interface to cpuid instruction.
// input:  eax = functionnumber, ecx = 0
// output: eax = output[0], ebx = output[1], ecx = output[2], edx = output[3]
static inline void cpuid (int output[4], int functionnumber) {
#if defined (_MSC_VER) || defined (__INTEL_COMPILER)       // Microsoft or Intel compiler, intrin.h included

  __cpuidex(output, functionnumber, 0);                  // intrinsic function for CPUID

#elif defined(__GNUC__) || defined(__clang__)              // use inline assembly, Gnu/AT&T syntax

  int a, b, c, d;
  __asm("cpuid" : "=a"(a),"=b"(b),"=c"(c),"=d"(d) : "a"(functionnumber),"c"(0) : );
  output[0] = a;
  output[1] = b;
  output[2] = c;
  output[3] = d;

#else                                                      // unknown platform. try inline assembly with masm/intel syntax

  __asm {
    mov eax, functionnumber
      xor ecx, ecx
      cpuid;
    mov esi, output
      mov [esi],    eax
      mov [esi+4],  ebx
      mov [esi+8],  ecx
      mov [esi+12], edx
      }

  #endif
}

int getNumCores(void) {
  //Assuming an Intel processor with CPUID leaf 11
  int cores = 0;
  #pragma omp parallel reduction(+:cores)
  {
    int regs[4];
    cpuid(regs,11);
    if(!(regs[3]&1)) cores++;
  }
  return cores;
}

int main(void) {
  printf("cores %d\n", getNumCores());
}

c++ - OpenMP：不要使用超线程内核（半 `num_threads()` w/超线程）

2 回答 2

Related

Reference