我试图通过编写和运行测试程序来了解硬件缓存的工作原理:
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#define LINE_SIZE 64
#define L1_WAYS 8
#define L1_SETS 64
#define L1_LINES 512
// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];
int main()
{
volatile uint8_t *addr;
register uint64_t i;
int junk = 0;
register uint64_t t1, t2;
printf("data: %p\n", data);
//_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
}
我使用和不使用 运行代码_mm_clflush
,而结果只是显示_mm_clflush
第一次内存访问速度更快。
与_mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 280
i = 1, cycles: 84
i = 2, cycles: 91
i = 3, cycles: 77
i = 4, cycles: 91
无 _mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 3899
i = 1, cycles: 91
i = 2, cycles: 105
i = 3, cycles: 77
i = 4, cycles: 84
刷新缓存行是没有意义的,但实际上变得更快?谁能解释为什么会这样?谢谢
----------------进一步实验-------------------
假设 3899 个周期是由 TLB 未命中引起的。为了证明我对缓存命中/未命中的了解,我稍微修改了这段代码来比较L1 cache hit
和情况下的内存访问时间L1 cache miss
。
这一次,代码跳过高速缓存行大小(64 字节)并访问下一个内存地址。
*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i*LINE_SIZE];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
由于我的电脑有一个8路组的关联L1数据缓存,有64组,总共32KB。如果我每 64 个字节访问一次内存,它应该会导致所有缓存未命中。但似乎已经缓存了很多缓存行:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 273
i = 1, cycles: 70
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 70
i = 5, cycles: 70
i = 6, cycles: 70
i = 7, cycles: 70
i = 8, cycles: 70
i = 9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i = 0, cycles: 301
i = 1, cycles: 133
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 147
i = 5, cycles: 56
i = 6, cycles: 70
i = 7, cycles: 63
i = 8, cycles: 70
i = 9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105
这是由预取引起的吗?还是我的理解有问题?谢谢