1

我在我的系统(core2duo 和 core i7 系统)中禁用了硬件预取器。我按照链接禁用它。如何以编程方式禁用硬件预取?

此外,我在编译程序时使用 -O0 选项禁用了 gcc 优化。禁用硬件预取后,我正在从缓存访问连续集(通过访问映射到缓存中连续集的数组索引),但是当启用硬件预取时,我仍然得到与以前相同的结果。

据我了解,在看到步幅模式后,启用了 H/W 预取器,它从较高的缓存/主内存中预取两个连续的缓存行(128 字节)并加载到较低的缓存中。因此,当访问缓存行时,会丢失对于缓存行,它是从更高的缓存加载的,也是由于硬件预取器而预加载的下一个缓存行。因此,我们从更高级别的缓存加载第一个缓存行时获得了更长的访问时间,但下一个缓存行的访问时间更少,因为它已经在 L1 缓存中,因为硬件预取器已经加载了它。

现在,如果 H/W prefetcher 被禁用,那么虽然检测到 stride pattern,H/W prefetcher 在访问相邻的先前缓存行期间不会提前从更高缓存中加载下一个缓存行,并且对于下一个缓存行将有一个未命中,它将从下一级缓存加载,因此预计此缓存行的访问时间会更长。

但是,实际上,即使在禁用 H/W 预取器之后,我也没有获得更高的连续缓存行访问时间,这意味着我的机器中根本没有禁用 H/W 预取器。

我对么?

还有 L2 流预取器(相邻缓存行)预取器,默认情况下是禁用的。(MSR 中的 BIT 19)

如何检查硬件预取器是否被禁用?有没有办法检查硬件预取器是否被禁用?

这是我的代码

#include <sys/time.h>
#include<stdlib.h>
#include<stdio.h>
#include<math.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
int main()
{
int cacheArray[10000],temp;
int i, block = 12;
unsigned long t1,t2,total;
struct timespec tim1,tim2;

for(i=0;i<5;i++)
{
clock_gettime(CLOCK_REALTIME, &tim1);
temp = cacheArray[block*16];
clock_gettime(CLOCK_REALTIME, &tim2);

t1=tim1.tv_sec*1000000000+(tim1.tv_nsec);
t2=tim2.tv_sec*1000000000+(tim2.tv_nsec);
total = t2 - t1;
printf("Accessing %d th block took %lu nanosec \n", block, total);
block =block + 1;
clock_gettime(CLOCK_REALTIME, &tim1);
temp = cacheArray[block*16];
clock_gettime(CLOCK_REALTIME, &tim2);
t1=tim1.tv_sec*1000000000+(tim1.tv_nsec);
t2=tim2.tv_sec*1000000000+(tim2.tv_nsec);
total = t2 - t1;
printf("Accessing %d th block took %lu nanosec \n", block, total);
block = block + 20;
}
}

这是我的示例输出:

Accessing 12 th block took 137 nanosec 
Accessing 13 th block took 54 nanosec 
Accessing 33 th block took 39 nanosec 
Accessing 34 th block took 37 nanosec 
Accessing 54 th block took 687 nanosec 
Accessing 55 th block took 93 nanosec 
Accessing 75 th block took 108 nanosec 
Accessing 76 th block took 107 nanosec 
Accessing 96 th block took 109 nanosec 
Accessing 97 th block took 106 nanosec 

我期望连续缓存行/块的访问时间相同/更高。为什么下一个缓存块/行被加载到缓存虽然硬件预取器被禁用,所以理论上下一个缓存行在不被访问时一定不能提前加载到缓存中。

任何建议或链接将不胜感激。提前致谢 。

4

1 回答 1

1

禁用硬件预取器后获得正确预期结果的更新程序

在这里,我在 index=i 处对同一元素进行了多次访问,并通过取多次访问的平均值来找到该 index=i 处的平均访问时间,通过这种方式,我得到了所有索引 i*16 和索引 (i+) 的正确预期结果1)*16。由于硬件预取器被禁用,我必须为缓存行 i 和缓存行 (i+1) 获得更长的访问时间,我的结果也表明了这一点。

注意:缓存块大小=64B,我使用的是整数数组,因为 int 需要 4Bytes,这就是为什么 index*16 和 (index+1)*16 将在连续的缓存行和不同的缓存行中。

#include <sys/time.h>
#include<stdlib.h>
#include<stdio.h>
#include<math.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

inline uint64_t rdtsc()
{
    unsigned long a, d;
    asm volatile ("rdtsc" : "=a" (a), "=d" (d));        
    return a | ((uint64_t)d << 32);
}

int main()
{

volatile uint64_t start, end, total;

int cacheArray[10000],temp;
int i,j, index ;

unsigned long long access_time1[100];
unsigned long long access_time2[100];


for(i=0;i<100;i++)
{
access_time1[i]=0;
access_time2[i]=0;
}


for(j=0;j<10000;j++)
{
    for(i=10;i<100;i+=20)
    {
    index=i;

    start = rdtsc();
    temp = cacheArray[index*16];
    end = rdtsc();

    total = end - start;
    access_time1[index]+=total;
    //printf("Accessing %d th block took %llu cycles \n", index, total);

    index = index + 1;

    start = rdtsc();
    temp = cacheArray[index*16];
    end = rdtsc();

    total =  end - start;
    access_time2[index]+=total;
    //printf("Accessing %d th block took %llu cycles \n\n", index, total);

    }
}


for(i=10;i<100;i+=20)
{

printf("Accessing %d th block took %llu nanosec \n", i, access_time1[i]/10000);
printf("Accessing %d th block took %llu nanosec \n\n", i+1, access_time2[i+1]/10000);

}


return 0;
}

Accessing 10 th block took 57 nanosec 
Accessing 11 th block took 63 nanosec 

Accessing 30 th block took 62 nanosec 
Accessing 31 th block took 66 nanosec 

Accessing 50 th block took 59 nanosec 
Accessing 51 th block took 62 nanosec 

Accessing 70 th block took 62 nanosec 
Accessing 71 th block took 65 nanosec 

Accessing 90 th block took 66 nanosec 
Accessing 91 th block took 71 nanosec 
于 2013-10-17T00:05:53.303 回答