我正在运行此代码(此处的完整代码:http: //codepad.org/5OJBLqIA)来计时重复的 daxpy 函数调用,无论是否预先从缓存中刷新操作数:
#define KB 1024
int main()
{
int cache_size = 32*KB;
double alpha = 42.5;
int operand_size = cache_size/(sizeof(double)*2);
double* X = new double[operand_size];
double* Y = new double[operand_size];
//95% confidence interval
double max_risk = 0.05;
//Interval half width
double w;
int n_iterations = 50000;
students_t dist(n_iterations-1);
double T = boost::math::quantile(complement(dist,max_risk/2));
accumulator_set<double, stats<tag::mean,tag::variance> > unflushed_acc;
for(int i = 0; i < n_iterations; ++i)
{
fill(X,operand_size);
fill(Y,operand_size);
double seconds = wall_time();
daxpy(alpha,X,Y,operand_size);
seconds = wall_time() - seconds;
unflushed_acc(seconds);
}
w = T*sqrt(variance(unflushed_acc))/sqrt(count(unflushed_acc));
printf("Without flush: time=%g +/- %g ns\n",mean(unflushed_acc)*1e9,w*1e9);
//Using clflush instruction
//We need to put the operands back in cache
accumulator_set<double, stats<tag::mean,tag::variance> > clflush_acc;
for(int i = 0; i < n_iterations; ++i)
{
fill(X,operand_size);
fill(Y,operand_size);
flush_array(X,operand_size);
flush_array(Y,operand_size);
double seconds = wall_time();
daxpy(alpha,X,Y,operand_size);
seconds = wall_time() - seconds;
clflush_acc(seconds);
}
w = T*sqrt(variance(clflush_acc))/sqrt(count(clflush_acc));
printf("With clflush: time=%g +/- %g ns\n",mean(clflush_acc)*1e9,w*1e9);
return 0;
}
此代码测量给定迭代次数的平均速率和不确定性。对大量迭代进行平均成功地最小化了由于来自不同内核的内存访问争用引起的差异(在我之前的问题中讨论过),但是因此获得的平均值在同一可执行文件的单独调用之间变化很大:
$ ./variance
Without flush: time=3107.76 +/- 0.268198 ns
With clflush: time=5862.33 +/- 9.84313 ns
$ ./variance
Without flush: time=3105.71 +/- 0.237823 ns
With clflush: time=7802.66 +/- 12.3163 ns
这些是一个接一个地运行。为什么刷新案例(但不是未刷新案例)的时间在进程之间变化如此之大,但在给定过程中却如此之小?
附录
代码在 Intel Xeon 5650 上的 Mac OS X 10.8 上运行。