作为本主题的后续,为了计算内存未命中延迟,我使用_mm_clflush
,__rdtsc
和_mm_lfence
(基于此问题/答案中的代码)编写了以下代码。
正如您在代码中看到的,我首先将数组加载到缓存中。然后我刷新一个元素,因此缓存行从所有缓存级别中逐出。我_mm_lfence
为了在-O3
.
接下来,我使用时间戳计数器来计算延迟或读数array[0]
。正如您在两个时间戳之间看到的,有三个指令:二lfence
和一read
。所以,我必须减去lfence
开销。代码的最后一部分计算了该开销。
在代码的最后,会打印开销和未命中延迟。但是,结果无效!
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
int main()
{
int array[ 100 ];
for ( int i = 0; i < 100; i++ )
array[ i ] = i;
uint64_t t1, t2, ov, diff;
_mm_lfence();
_mm_clflush( &array[ 0 ] );
_mm_lfence();
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
int tmp = array[ 0 ];
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
diff = t2 - t1;
printf( "diff is %lu\n", diff );
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
ov = t2 - t1;
printf( "lfence overhead is %lu\n", ov );
printf( "miss cycles is %lu\n", diff-ov );
return 0;
}
但是,输出无效
$ gcc -O3 -o flush1 flush1.c
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 147
miss cycles is 14
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 154
miss cycles is 7
$ taskset -c 0 ./flush1
diff is 147
lfence overhead is 154
miss cycles is 18446744073709551609
任何想法?
接下来,我尝试clock_gettime
了函数来计算未命中延迟,如下所示
_mm_lfence();
_mm_clflush( &array[ 0 ] );
_mm_lfence();
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
_mm_lfence();
int tmp = array[ 0 ];
_mm_lfence();
clock_gettime(CLOCK_MONOTONIC, &end);
diff = 1000000000 * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("miss elapsed time = %lu nanoseconds\n", diff);
输出是miss elapsed time = 578 nanoseconds
。那可靠吗?
更新1:
感谢彼得和哈迪,总结到目前为止的反应,我发现
1- 在优化阶段省略了未使用的变量,这就是我在输出中看到的奇怪值的原因。感谢彼得的回复,有一些方法可以解决这个问题。
2-
clock_gettime
不适合这种分辨率,该功能用于更大的延迟。
作为一种解决方法,我尝试将数组放入缓存中,然后刷新所有元素以确保所有元素都从所有缓存级别中逐出。然后我测量了然后的array[0]
延迟array[20]
。由于每个元素是 4 字节,因此距离是 80 字节。我希望得到两次缓存未命中。但是,延迟array[20]
类似于缓存命中。一个安全的猜测是高速缓存行不是 80 字节。因此,可能array[20]
是由硬件预取的。并非总是如此,但我也再次看到一些奇怪的结果
for ( int i = 0; i < 100; i++ ) {
_mm_lfence();
_mm_clflush( &array[ i ] );
_mm_lfence();
}
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
int tmp = array[ 0 ];
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
diff1 = t2 - t1;
printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );
_mm_lfence();
t1 = __rdtsc();
tmp = array[ 20 ];
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
diff2 = t2 - t1;
printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
ov = t2 - t1;
printf( "lfence overhead is %lu\n", ov );
printf( "TSC1 is %lu\n", diff1-ov );
printf( "TSC2 is %lu\n", diff2-ov );
输出是
$ ./flush1
tmp is 0
diff1 is 371
tmp is 20
diff2 is 280
lfence overhead is 147
TSC1 is 224
TSC2 is 133
$ ./flush1
tmp is 0
diff1 is 399
tmp is 20
diff2 is 280
lfence overhead is 154
TSC1 is 245
TSC2 is 126
$ ./flush1
tmp is 0
diff1 is 392
tmp is 20
diff2 is 840
lfence overhead is 147
TSC1 is 245
TSC2 is 693
$ ./flush1
tmp is 0
diff1 is 364
tmp is 20
diff2 is 140
lfence overhead is 154
TSC1 is 210
TSC2 is 18446744073709551602
“硬件预取器带来其他块”的说法大约有 80% 正确。那是怎么回事?还有更准确的说法吗?