benchmarking - 如何正确使用rdtscp？

Question

根据《How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures》，我使用以下代码：

static inline uint64_t bench_start(void)
{
    unsigned cycles_low, cycles_high;
    asm volatile("CPUID\n\t"
        "RDTSCP\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t"
        : "=r" (cycles_high), "=r" (cycles_low)
        ::"%rax", "%rbx", "%rcx", "%rdx");

    return (uint64_t) cycles_high << 32 | cycles_low;
}

static inline uint64_t bench_end(void)
{
     unsigned cycles_low, cycles_high;
     asm volatile("RDTSCP\n\t"
         "mov %%edx, %0\n\t"
         "mov %%eax, %1\n\t"
         "CPUID\n\t"
         : "=r" (cycles_high), "=r" (cycles_low)
         ::"%rax", "%rbx", "%rcx", "%rdx");
     return (uint64_t) cycles_high << 32 | cycles_low;
}

但实际上，我也看到有人使用下面的代码：

static inline uint64_t bench_start(void)
{
   unsigned cycles_low, cycles_high;
   asm_volatile("RDTSCP\n\t"
                : "=d" (cycles_high), "=a" (cycles_low));
   return (uint64_t) cycles_high << 32 | cycles_low;
}

static inline uint64_t bench_start(void)
{
   unsigned cycles_low, cycles_high;
   asm_volatile("RDTSCP\n\t"
                : "=d" (cycles_high), "=a" (cycles_low));
   return (uint64_t) cycles_high << 32 | cycles_low;
}

如您所知，RDTSCP 是伪序列化，为什么有人使用第二个代码？我猜有两个原因，如下：

也许在大多数情况下，RDTSCP 可以确保完整的“按顺序执行”？
也许只是想避免使用 CPUID 来提高效率？

benchmarking - 如何正确使用rdtscp？

0 回答 0

Related

Reference