我正试图弄清楚我们为 HPC 应用程序获得的一些相当令人失望的性能结果。我在 Visual Studio 2010 中编写了以下基准测试,它提炼了我们应用程序的精髓(大量独立的、高算术强度的操作):
#include "stdafx.h"
#include <math.h>
#include <time.h>
#include <Windows.h>
#include <stdio.h>
#include <memory.h>
#include <process.h>
void makework(void *jnk) {
double tmp = 0;
for(int j=0; j<10000; j++) {
for(int i=0; i<1000000; i++) {
tmp = tmp+(double)i*(double)i;
}
}
*((double *)jnk) = tmp;
_endthread();
}
void spawnthreads(int num) {
HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
double *junk = (double *)malloc(num*sizeof(double));
printf("Starting %i threads... ", num);
for(int i=0; i<num; i++) {
hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
}
int start = GetTickCount();
WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
int end = GetTickCount();
FILE *fp = fopen("makework.log", "a+");
fprintf(fp, "%i,%.3f\n", num, (double)(end-start)/1000.0);
fclose(fp);
printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
free(hThreads);
free(junk);
}
int _tmain(int argc, _TCHAR* argv[])
{
for(int i=1; i<=20; i++) {
spawnthreads(i);
}
return 0;
}
我在每个线程中执行完全相同的操作,所以它应该(理想情况下)持续约 11 秒,直到我填满物理内核,然后当我开始使用逻辑超线程内核时可能会加倍。不应该有任何缓存问题,因为我的循环变量和结果可以放入寄存器。
这是我在两个运行 Windows Server 2008 的测试平台上的实验结果。
机器 1 Dual Xeon X5690 @ 3.47 GHz -- 12 个物理核心,24 个逻辑核心,Westmere 架构
Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds
机器 2 Dual Xeon E5-2690 @ 2.90 GHz -- 16 个物理核心,32 个逻辑核心,Sandy Bridge 架构
Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds
以下是我觉得令人费解的方面:
为什么 Westmere 机器经过的时间一直保持不变,直到大约 6 核,然后突然跳跃,然后在 10 线程以上基本保持不变?Windows 是否在移动到第二个处理器之前将所有线程填充到单个处理器中,以便在一个处理器被填充后超线程不确定地启动?
为什么 Sandy Bridge 机器经过的时间基本上随着线程数线性增加,直到大约 12?考虑到核心的数量,12 对我来说似乎不是一个有意义的数字。
任何关于处理器计数器测量/改进我的基准的方法的想法和建议都值得赞赏。这是架构问题还是 Windows 问题?
编辑:
正如下面所建议的,编译器做了一些奇怪的事情,所以我编写了自己的汇编代码,它做与上面相同的事情,但将所有 FP 操作留在 FP 堆栈上以避免任何内存访问:
void makework(void *jnk) {
register int i, j;
// register double tmp = 0;
__asm {
fldz // this holds the result on the stack
}
for(j=0; j<10000; j++) {
__asm {
fldz // push i onto the stack: stack = 0, res
}
for(i=0; i<1000000; i++) {
// tmp += (double)i * (double)i;
__asm {
fld st(0) // stack: i, i, res
fld st(0) // stack: i, i, i, res
fmul // stack: i*i, i, res
faddp st(2), st(0) // stack: i, res+i*i
fld1 // stack: 1, i, res+i*i
fadd // stack: i+1, res+i*i
}
}
__asm {
fstp st(0) // pop i off the stack leaving only res in st(0)
}
}
__asm {
mov eax, dword ptr [jnk]
fstp qword ptr [eax]
}
// *((double *)jnk) = tmp;
_endthread();
}
这组装为:
013E1002 in al,dx
013E1003 fldz
013E1005 mov ecx,2710h
013E100A lea ebx,[ebx]
013E1010 fldz
013E1012 mov eax,0F4240h
013E1017 fld st(0)
013E1019 fld st(0)
013E101B fmulp st(1),st
013E101D faddp st(2),st
013E101F fld1
013E1021 faddp st(1),st
013E1023 dec eax
013E1024 jne makework+17h (13E1017h)
013E1026 fstp st(0)
013E1028 dec ecx
013E1029 jne makework+10h (13E1010h)
013E102B mov eax,dword ptr [jnk]
013E102E fstp qword ptr [eax]
013E1030 pop ebp
013E1031 jmp dword ptr [__imp___endthread (13E20C0h)]
上面机器 1 的结果是:
Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds
因此,一个线程的速度大约慢 9%(可能是 inc eax 与 fld1 和 faddp 之间的差异?),当所有物理内核都被填满时,它的速度几乎是原来的两倍(这是超线程所预期的)。但是,仅从 6 个线程开始性能下降的令人费解的方面仍然存在……