我正在执行如下所示的基准测试
CHECK( context = clCreateContext(props, 1, &device, NULL, NULL, &_err); );
CHECK( queue = clCreateCommandQueue(context, device, 0, &_err); );
#define SYNC() clFinish(queue)
#define LAUNCH(glob, loc, kernel) OCL(clEnqueueNDRangeKernel(queue, kernel, 2,\
NULL, glob, loc,\
0, NULL, NULL))
/* Build program, set arguments over here */
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
}
SYNC();
STOP;
printf("Time taken (plus) : %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (minus): %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (both) : %lf\n", uSec / iter);
结果看起来很奇怪:
Time taken (plus) : 31.450000
Time taken (minus): 28.120000
Time taken (both) : 2256.380000
START
, 并且STOP
只是启动和停止计时器的宏。这是相关的宏。
我不确定为什么排队是内核减慢它们的速度(并且仅在 AMD GPU 上)!
编辑我正在使用 Radeon 7970
编辑两个内核都在独立内存上运行。这里还有系统信息。
操作系统:Ubuntu 11.10
fglrx信息:
display: :0 screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon HD 7900 Series
OpenGL version string: 4.2.11762 Compatibility Profile Context