我正在尝试优化最接近的配对蛮力算法并将其与非缓存程序进行比较,但我被卡住了。
主要问题是,当我使用 for 循环缓存计算时,性能会变得更差cached time = 2 x non-cached time
。如果我改变块的大小,它就像什么都没有发生......我使用一个结构点数组 P 作为 x,y 坐标
这是非缓存代码:
void compare_points_BF(int *N, point *P){
int i, j, p1, p2;
float dx, dy, distance=0, min_dist=inf();
long calc = 0;
for (i=0; i<(*N-1) ; i++){
for (j=i+1; j<*N; j++){
dx = P[i].x - P[j].x;
dy = P[i].y - P[j].y;
//calculate distance of current points
distance = (dx * dx) + (dy * dy);
calc++;
if (distance < min_dist){
min_dist = distance;
p1 = i;
p2 = j;
}
}
}
printf("%ld calculations\t", calc);
}
这是缓存的版本:
void compare_points_BF(int *N, int *B, point *P){
int i, j, ib, jb, p1, p2, num_blocks = (*N + (*B-1)) / (*B);
float dist=0, min_dist=inf();
long calc=0;
//break array data in N/B blocks
for (i = 0; i < num_blocks; i++){
for (j = i; j < num_blocks; j++){
for (jb = j * (*B); jb < min((j+1) * (*B), *N); jb++){
//avoid double comparisons that occur when i block = j block
for (i == j ? (ib = jb + 1) : (ib = i*(*B)); ib < min((i+1) * (*B), *N); ib++){
calc++;
//calculate distance of current points
if((dist = (P[ib].x - P[jb].x) * (P[ib].x - P[jb].x) +
(P[ib].y - P[jb].y) * (P[ib].y - P[jb].y)) < min_dist){
min_dist = dist;
p1 = ib;
p2 = jb;
}
}
}
}
}
printf("%ld calculations\t", calc);
}
例如非缓存程序的输出是:
N = 8192 Run time: 0,080 sec
N = 16384 Run time: 0,330 sec
N = 32768 Run time: 1,280 sec
N = 65.536 Run time: 5,290 sec
N = 131.072 Run time: 21,290sec
N = 262.144 Run time: 81,880sec
N = 524.288 Run time: 327,460 sec
但是通过缓存的示例,我得到:
33550336 calculations Block_size = 128 N = 8192 Run time: 0.402 sec
33550336 calculations Block_size = 256 N = 8192 Run time: 0.383 sec
33550336 calculations Block_size = 512 N = 8192 Run time: 0.384 sec
33550336 calculations Block_size = 1024 N = 8192 Run time: 0.381 sec
33550336 calculations Block_size = 2048 N = 8192 Run time: 0.398 sec
33550336 calculations Block_size = 4096 N = 8192 Run time: 0.400 sec
33550336 calculations Block_size = 8192 N = 8192 Run time: 0.401 sec
33550336 calculations Block_size = 16384 N = 8192 Run time: 0.383 sec
134209536 calculations Block_size = 128 N = 16384 Run time: 1.579 sec
134209536 calculations Block_size = 256 N = 16384 Run time: 1.610 sec
134209536 calculations Block_size = 512 N = 16384 Run time: 1.630 sec
134209536 calculations Block_size = 1024 N = 16384 Run time: 1.530 sec
134209536 calculations Block_size = 2048 N = 16384 Run time: 1.537 sec
134209536 calculations Block_size = 4096 N = 16384 Run time: 1.562 sec
134209536 calculations Block_size = 8192 N = 16384 Run time: 1.520 sec
134209536 calculations Block_size = 16384 N = 16384 Run time: 1.626 sec
536854528 calculations Block_size = 128 N = 32768 Run time: 6.170 sec
536854528 calculations Block_size = 256 N = 32768 Run time: 6.207 sec
536854528 calculations Block_size = 512 N = 32768 Run time: 6.219 sec
536854528 calculations Block_size = 1024 N = 32768 Run time: 6.131 sec
536854528 calculations Block_size = 2048 N = 32768 Run time: 6.077 sec
536854528 calculations Block_size = 4096 N = 32768 Run time: 6.216 sec
536854528 calculations Block_size = 8192 N = 32768 Run time: 6.130 sec
536854528 calculations Block_size = 16384 N = 32768 Run time: 6.181 sec
我一遍又一遍地检查了代码,它似乎是正确的。我在这里想念什么?编译器是否优化代码以实现比我想要实现的更好的缓存使用?提前致谢!