c - 奇怪的多线程性能

Question

我正试图弄清楚我们为 HPC 应用程序获得的一些相当令人失望的性能结果。我在 Visual Studio 2010 中编写了以下基准测试，它提炼了我们应用程序的精髓（大量独立的、高算术强度的操作）：

#include "stdafx.h"
#include <math.h>
#include <time.h>
#include <Windows.h>
#include <stdio.h>
#include <memory.h>
#include <process.h>

void makework(void *jnk) {
    double tmp = 0;
    for(int j=0; j<10000; j++) {
        for(int i=0; i<1000000; i++) {
            tmp = tmp+(double)i*(double)i;
        }
    }
    *((double *)jnk) = tmp;
    _endthread();
}

void spawnthreads(int num) {
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f\n", num, (double)(end-start)/1000.0);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
    free(junk);
}

int _tmain(int argc, _TCHAR* argv[])
{
    for(int i=1; i<=20; i++) {
        spawnthreads(i);
    }
    return 0;
}

我在每个线程中执行完全相同的操作，所以它应该（理想情况下）持续约 11 秒，直到我填满物理内核，然后当我开始使用逻辑超线程内核时可能会加倍。不应该有任何缓存问题，因为我的循环变量和结果可以放入寄存器。

这是我在两个运行 Windows Server 2008 的测试平台上的实验结果。

机器 1 Dual Xeon X5690 @ 3.47 GHz -- 12 个物理核心，24 个逻辑核心，Westmere 架构

Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds

机器 2 Dual Xeon E5-2690 @ 2.90 GHz -- 16 个物理核心，32 个逻辑核心，Sandy Bridge 架构

Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds

以下是我觉得令人费解的方面：

为什么 Westmere 机器经过的时间一直保持不变，直到大约 6 核，然后突然跳跃，然后在 10 线程以上基本保持不变？Windows 是否在移动到第二个处理器之前将所有线程填充到单个处理器中，以便在一个处理器被填充后超线程不确定地启动？
为什么 Sandy Bridge 机器经过的时间基本上随着线程数线性增加，直到大约 12？考虑到核心的数量，12 对我来说似乎不是一个有意义的数字。

任何关于处理器计数器测量/改进我的基准的方法的想法和建议都值得赞赏。这是架构问题还是 Windows 问题？

编辑：

正如下面所建议的，编译器做了一些奇怪的事情，所以我编写了自己的汇编代码，它做与上面相同的事情，但将所有 FP 操作留在 FP 堆栈上以避免任何内存访问：

void makework(void *jnk) {
    register int i, j;
//  register double tmp = 0;
    __asm {
        fldz  // this holds the result on the stack
    }
    for(j=0; j<10000; j++) {
        __asm {
            fldz // push i onto the stack: stack = 0, res
        }
        for(i=0; i<1000000; i++) {
            // tmp += (double)i * (double)i;
            __asm {
                fld st(0)  // stack: i, i, res
                fld st(0)  // stack: i, i, i, res
                fmul       // stack: i*i, i, res
                faddp st(2), st(0) // stack: i, res+i*i
                fld1       // stack: 1, i, res+i*i
                fadd      // stack: i+1, res+i*i
            }
        }
        __asm {
            fstp st(0)   // pop i off the stack leaving only res in st(0)
        }
    }
    __asm {
        mov eax, dword ptr [jnk]
        fstp qword ptr [eax]
    }
//  *((double *)jnk) = tmp;
    _endthread();
}

这组装为：

013E1002  in          al,dx  
013E1003  fldz  
013E1005  mov         ecx,2710h  
013E100A  lea         ebx,[ebx]  
013E1010  fldz  
013E1012  mov         eax,0F4240h  
013E1017  fld         st(0)  
013E1019  fld         st(0)  
013E101B  fmulp       st(1),st  
013E101D  faddp       st(2),st  
013E101F  fld1  
013E1021  faddp       st(1),st  
013E1023  dec         eax  
013E1024  jne         makework+17h (13E1017h)  
013E1026  fstp        st(0)  
013E1028  dec         ecx  
013E1029  jne         makework+10h (13E1010h)  
013E102B  mov         eax,dword ptr [jnk]  
013E102E  fstp        qword ptr [eax]  
013E1030  pop         ebp  
013E1031  jmp         dword ptr [__imp___endthread (13E20C0h)]

上面机器 1 的结果是：

Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds

因此，一个线程的速度大约慢 9%（可能是 inc eax 与 fld1 和 faddp 之间的差异？），当所有物理内核都被填满时，它的速度几乎是原来的两倍（这是超线程所预期的）。但是，仅从 6 个线程开始性能下降的令人费解的方面仍然存在……

score 2 · Accepted Answer

现在完全跛脚并回答我自己的问题——它似乎是@us2012 建议的调度程序。我对关联掩码进行硬编码以首先填充物理内核，然后切换到超线程内核：

void spawnthreads(int num) {
    ULONG_PTR masks[] = {  // for my system; YMMV
        0x1, 0x4, 0x10, 0x40, 0x100, 0x400, 0x1000, 0x4000, 0x10000, 0x40000, 
        0x100000, 0x400000, 0x2, 0x8, 0x20, 0x80, 0x200, 0x800, 0x2000, 0x8000};
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
        SetThreadAffinityMask(hThreads[i], masks[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f,%f\n", num, (double)(end-start)/1000.0, junk[0]);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
}

并得到

Starting 1 threads... Elapsed time: 12.558 seconds
Starting 2 threads... Elapsed time: 12.558 seconds
Starting 3 threads... Elapsed time: 12.589 seconds
Starting 4 threads... Elapsed time: 12.652 seconds
Starting 5 threads... Elapsed time: 12.621 seconds
Starting 6 threads... Elapsed time: 12.777 seconds
Starting 7 threads... Elapsed time: 12.636 seconds
Starting 8 threads... Elapsed time: 12.886 seconds
Starting 9 threads... Elapsed time: 13.057 seconds
Starting 10 threads... Elapsed time: 12.714 seconds
Starting 11 threads... Elapsed time: 12.777 seconds
Starting 12 threads... Elapsed time: 12.668 seconds
Starting 13 threads... Elapsed time: 26.489 seconds
Starting 14 threads... Elapsed time: 26.505 seconds
Starting 15 threads... Elapsed time: 26.505 seconds
Starting 16 threads... Elapsed time: 26.489 seconds
Starting 17 threads... Elapsed time: 26.489 seconds
Starting 18 threads... Elapsed time: 26.676 seconds
Starting 19 threads... Elapsed time: 26.770 seconds
Starting 20 threads... Elapsed time: 26.489 seconds

这是预期的。现在的问题是，我可以调整哪些操作系统设置以使其更接近默认行为，因为我们的大部分代码都是用 MATLAB 编写的......

score 1 · Accepted Answer

（可能的解释）您检查过这些机器上的后台活动吗？操作系统可能无法将其所有核心完全奉献给您。在您的机器 1 上，当您开始占据超过一半的核心时，就会开始显着增长。您的线程可能会与其他线程竞争资源。

您可能还想检查您的计算机/帐户上不允许获取所有可用资源的限制和域策略。

score 1 · Accepted Answer

在具有 2 个物理内核和 4 个逻辑内核的笔记本电脑上，我得到：

<br>
Starting 1 threads... Elapsed time: 11.638 seconds<br>
Starting 2 threads... Elapsed time: 12.418 seconds<br>
Starting 3 threads... Elapsed time: 13.556 seconds<br>
Starting 4 threads... Elapsed time: 14.929 seconds<br>
Starting 5 threads... Elapsed time: 20.811 seconds<br>
Starting 6 threads... Elapsed time: 22.776 seconds<br>
Starting 7 threads... Elapsed time: 27.160 seconds<br>
Starting 8 threads... Elapsed time: 30.249 seconds<br>

一旦我们有超过 1 个线程，这就会显示降级。

我怀疑原因是函数 makework() 正在进行内存访问。通过在 _tmain() 的第一行设置断点，您可以在 Visual Studio 2010 中看到这一点。当您遇到断点时，按 Ctrl-Alt-D 以查看反汇编窗口。在括号中看到寄存器名称的任何地方（例如 [esp] ），它都是内存访问。CPU 上的 1 级内存高速缓存带宽正在饱和。你可以用修改过的 makework() 来测试这个理论；

    void makework(void *jnk) {
    double tmp = 0;
    volatile double *p;
    int i;
    int j;
    p=(double*)jnk;

    for(j=0; j<100000000; j++) {
        for(i=0; i<100; i++) {
            tmp = tmp+(double)i*(double)i;
        }
        *p=tmp;
    }
    *p = tmp;
    _endthread();
}

它执行相同数量的计算，但每 100 次迭代会抛出额外的内存写入。在我的笔记本电脑上，结果是：

Starting 1 threads... Elapsed time: 11.684 seconds<br>
Starting 2 threads... Elapsed time: 13.760 seconds<br>
Starting 3 threads... Elapsed time: 14.445 seconds<br>
Starting 4 threads... Elapsed time: 17.519 seconds<br>
Starting 5 threads... Elapsed time: 23.369 seconds<br>
Starting 6 threads... Elapsed time: 25.491 seconds<br>
Starting 7 threads... Elapsed time: 30.155 seconds<br>
Starting 8 threads... Elapsed time: 34.460 seconds<br>

这显示了内存访问可能对结果产生的影响。我尝试了各种 VS2010 编译器设置，看看是否可以让 makework() 没有内存访问，但没有运气。要真正研究原始 CPU 核心性能与活动线程数，我怀疑我们必须在汇编程序中编写 makework() 代码。

score 0 · Accepted Answer

好的，现在我们已经排除了内存饱和理论（尽管 - x87？哎呀，不要期望那里有太多性能。如果你能接受他们提供的东西，请尝试切换到 SSE/AVX）。核心缩放应该仍然有意义，让我们看看您使用的 CPU 型号：

你能验证这些是正确的模型吗？

Intel® Xeon® Processor X5690 (12M Cache, 3.46 GHz, 6.40 GT/s Intel® QPI)

http://ark.intel.com/products/52576

Intel® Xeon® Processor E5-2690 (20M Cache, 2.90 GHz, 8.00 GT/s Intel® QPI)

http://ark.intel.com/products/64596/

如果是这样，那么第一个确实有 6 个物理核心（12 个逻辑核心），第二个有 8 个物理核心（16 个逻辑核心）。想一想，我认为在这些世代中，您无法在单个插槽上获得更高的核心数量，因此这是有道理的，并且非常适合您的数字。

编辑：在多插槽系统上，操作系统可能更喜欢单插槽，而逻辑内核仍然可用。它可能取决于确切的版本，但对于 win server 2008，这里有一个有趣的评论 - http://blogs.technet.com/b/matthts/archive/2012/10/14/windows-server-sockets-logical-处理器对称多线程.aspx

引用：

When the OS boots it starts with socket 1 and enumerates all logical processors:

    on socket 1 it enumerates logical processors 1-20
    on socket 2 it enumerates logical processors 21-40
    on socket 3 it enumerates logical processors 41-60
    on socket 4 it would see 61-64

如果这是您的操作系统唤醒线程的顺序，那么 SMT 可能会在溢出到第二个套接字之前启动

c - 奇怪的多线程性能

4 回答 4

Related

Reference