c++ - 如何在 GCC x86 中使用 RDTSC 计算时钟周期？

Question

使用 Visual Studio，我可以从处理器读取时钟周期计数，如下所示。我如何对 GCC 做同样的事情？

#ifdef _MSC_VER             // Compiler: Microsoft Visual Studio

    #ifdef _M_IX86                      // Processor: x86

        inline uint64_t clockCycleCount()
        {
            uint64_t c;
            __asm {
                cpuid       // serialize processor
                rdtsc       // read time stamp counter
                mov dword ptr [c + 0], eax
                mov dword ptr [c + 4], edx
            }
            return c;
        }

    #elif defined(_M_X64)               // Processor: x64

        extern "C" unsigned __int64 __rdtsc();
        #pragma intrinsic(__rdtsc)
        inline uint64_t clockCycleCount()
        {
            return __rdtsc();
        }

    #endif

#endif

score 29 · Accepted Answer

其他答案有效，但您可以通过使用 GCC 的__rdtsc内在函数来避免内联汇编，包括x86intrin.h.

它定义在gcc/config/i386/ia32intrin.h：

/* rdtsc */
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void)
{
  return __builtin_ia32_rdtsc ();
}

score 28 · Accepted Answer

在 Linux 的最新版本中，gettimeofday 将包含纳秒计时。

如果你真的想调用 RDTSC 你可以使用下面的内联汇编：

http://www.mcs.anl.gov/~kazutomo/rdtsc.html

#if defined(__i386__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}

#elif defined(__x86_64__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#endif

score 18 · Accepted Answer

更新：在更规范的问题上重新发布并更新了此答案。一旦我们整理出将哪个问题用作关闭所有类似rdtsc问题的重复目标，我可能会在某个时候删除它。

您不需要也不应该为此使用内联 asm。没有任何好处；编译器内置了rdtscand rdtscp，并且（至少现在）__rdtsc如果您包含正确的标头，则它们都定义了一个内在函数。 https://gcc.gnu.org/wiki/DontUseInlineAsm

不幸的是，MSVC 对于非 SIMD 内部函数使用哪个标头不同意其他所有人。（英特尔的 intriniscs 指南 #include <immintrin.h>对此进行了说明，但使用 gcc 和 clang，非 SIMD 内在函数主要位于x86intrin.h.）

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

使用所有 4 个主要编译器进行编译：gcc/clang/ICC/MSVC，用于 32 位或 64 位。在 Godbolt 编译器资源管理器上查看结果。

有关lfence用于提高可重复性的更多信息rdtsc，请参阅@HadiBrais 对 clflush 的回答，以通过 C 函数使缓存行无效。

另请参阅LFENCE 是否在 AMD 处理器上进行序列化？（TL：DR 是的，启用 Spectre 缓解，否则内核会保留相关的 MSR 未设置。）

`rdtsc`计算参考周期，而不是 CPU 核心时钟周期

无论涡轮/省电如何，它都以固定频率计数，因此如果您想要按时钟进行 uops-per-clock 分析，请使用性能计数器。 rdtsc与挂钟时间完全相关（系统时钟调整除外，因此基本上是steady_clock）。它以 CPU 的额定频率滴答作响，即标榜的标签频率。

如果您将其用于微基准测试，请先包含一个预热期，以确保您的 CPU 在开始计时之前已经处于最大时钟速度。或者更好的是，如果您的定时区域足够长，您可以附加一个perf stat -p PID. 不过，您通常仍希望在微基准测试期间避免 CPU 频率偏移。

也不保证所有内核的 TSC 都是同步的。因此，如果您的线程迁移到之间的另一个 CPU 内核__rdtsc()，可能会有额外的倾斜。（不过，大多数操作系统都尝试同步所有内核的 TSC。）如果您rdtsc直接使用，您可能希望将您的程序或线程固定到一个内核，例如taskset -c 0 ./myprogram在 Linux 上。

使用内在函数的 asm 有多好？

它至少和内联汇编一样好。

它的非内联版本为 x86-64 编译 MSVC，如下所示：

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

对于在中返回 64 位整数的 32 位调用约定edx:eax，它只是rdtsc/ ret。没关系，你总是希望它内联。

在使用它两次并减去时间间隔的测试调用者中：

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

所有 4 个编译器都编写了非常相似的代码。这是 GCC 的 32 位输出：

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

这是 MSVC 的 x86-64 输出（应用了名称分解）。gcc/clang/ICC 都发出相同的代码。

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

所有 4 个编译器都使用or+mov而不是lea将低半部分和高半部分组合到不同的寄存器中。我猜这是他们未能优化的固定序列。

但是你自己在 inline asm 中编写它也好不到哪里去。如果您的时间间隔如此之短以至于您只保留 32 位结果，那么您将剥夺编译器忽略 EDX 中结果的高 32 位的机会。或者如果编译器决定将开始时间存储到内存中，它可以只使用两个 32 位存储而不是 shift/或/mov。如果 1 个额外的 uop 作为计时的一部分让您感到困扰，您最好用纯 asm 编写整个微基准测试。

score 5 · Accepted Answer

在带有的 Linux 上gcc，我使用以下内容：

/* define this somewhere */
#ifdef __i386
__inline__ uint64_t rdtsc() {
  uint64_t x;
  __asm__ volatile ("rdtsc" : "=A" (x));
  return x;
}
#elif __amd64
__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}
#endif

/* now, in your function, do the following */
uint64_t t;
t = rdtsc();
// ... the stuff that you want to time ...
t = rdtsc() - t;
// t now contains the number of cycles elapsed

c++ - 如何在 GCC x86 中使用 RDTSC 计算时钟周期？

4 回答 4

rdtsc计算参考周期，而不是 CPU 核心时钟周期

使用内在函数的 asm 有多好？

Related

Reference

`rdtsc`计算参考周期，而不是 CPU 核心时钟周期