x86 - 了解 clwb 指令的性能和行为

Question

我试图了解 clwb 指令的读/写性能，并测试它在写入缓存行的情况下如何变化，而我只是在读取它。我希望对于写入情况，所花费的时间应该高于读取情况。为了进行同样的测试，这是我在 Intel Xeon CPU (skylake) 上运行并使用非易失性内存 (NVM) 进行读写存储的小代码片段

/* nvm_alloc allocates memory on NVM */
uint64_t *array = (uint64_t *) nvm_alloc(pool, 512);
uint64_t *p = &array[0];
/* separated p & q by the size of write unit in Optane (256B) */
uint64_t *q = &array[32];

uint64_t time_1 = 0;
uint64_t time_2 = 0;
uint64_t start;

volatile uint64_t x;
for(int i = 0; i < 1000000; i++)
{
        /* issues an mfence instruction */
        mfence();
        /* this is for the read case, bring p into cache */
        /* commented read case */
        //x = *p;
        /* this is for the write case, update cacheline containing p */
        *p = *p + 1;
        *q = *q + 1;
        /* rdtscp here to flush instruction pipeline */
        start = rdtscp();
        /* issue clwb on cacheline containing p */
        clwb(p);
        time_1 += rdtsc() - start;

        start = rdtsc();
        clwb(q);
        time_2 += rdtsc() - start;
}

由于 clwb 没有明确地逐出缓存行，因此读取的下一次迭代可能会从缓存本身提供。在写入的情况下，缓存行在每次迭代中被修改，然后发出 clwb 将其写回。但是，写入所需的时间几乎等于我无法理解的读取情况。如果写入时间不包括将脏缓存线写回内存（或内存控制器）的时间

x86 - 了解 clwb 指令的性能和行为

0 回答 0

Related

Reference