c - 使用 rdmsr/rdpmc 进行分支预测精度

Question

我试图了解分支预测单元如何在 CPU 中工作。

我也使用过papilinux perf-events，但它们都没有给出准确的结果（就我而言）。

这是我的代码：

void func(int* arr, int sequence_len){
  for(int i = 0; i < sequence_len; i++){
      // region starts
      if(arr[i]){
          do_sth();
      }
      // region ends
  }
}

我的数组由 0 和 1 组成。它有一个大小为sequence_len. 例如，如果我的尺码是 8 号，那么它就有类似的图案0 1 0 1 0 0 1 1。

试验一：

我试图了解 CPU 如何预测这些分支。因此，我使用 papi 并为错误预测的分支预测设置了性能计数器（我知道它也计算间接分支）。

int func(){
  papi_read(r1);
  for(){
    //... same as above
  }
  papi_read(r2);
  return r2-r1;
}

int main(){
   init_papi();
   for(int i = 0; i < 10; i++)
     res[i] = func();

   print(res[i]);
}

我看到的输出是（对于200的序列长度）

100 #iter1
40  #iter2
10  #iter3
3
0
0
#...

所以，一开始，CPU 盲目地预测序列，只成功了一半。在接下来的迭代中，CPU 可以预测得越来越好。经过一些迭代后，CPU 可以完美地猜到。

试验 2

我想看看，CPU 错误预测在哪个数组索引处。

int* func(){
  int* results;
  for(){
    papi_read(r1);
    if(arr[i])
        do_sth();   
    papi_read(r2);
    res[i] = r2-r1;
  }
  return res;
}

int main(){
   init_papi();
   for(int i = 0; i < 10; i++)
     res[i] = func();

   print(res[i]);
}

预期结果：

#1st iteration, 0 means no mispred, 1 means mispred
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
0 0 0 0 1 0 0 0 1 0 0... # total of 200 results
Mispred: 40/200 # it learned from previous iteration
#3rd iteration
0 0 0 0 0 0 0 0 1 0 0... # total of 200 results
Mispred: 10/200 # continues to learn
#...

收到的结果：

#1st iteration
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
1 0 0 0 1 1 0 1 0 0 0... # total of 200 results
Mispred: 100/200 # it DID NOT learn from previous iteration
#3rd iteration
0 1 0 1 0 1 0 1 1 0 0... # total of 200 results
Mispred: 100/200 # NO LEARNING
#...

我的观察

当我在 for 循环之外测量错误预测时，我可以看到 CPU 从错误预测中学习。但是，当我尝试测量单个分支指令的错误预测时，CPU 要么无法学习，要么我测量错误。

我的解释

我给出 200 作为序列长度。CPU 有一个小的分支预测器，如 Intel 中的 2-3 位饱和计数器，以及一个大的全局分支预测器。当我在环路外进行测量时，我会在测量中引入更少的噪声。通过减少噪音，我的意思是papi电话。

想一想：在循环测量之外

全球历史是：papi_start, branch_outcome1, branch_outcome2, branch_outcome3, ..., papi_end, papi_start (2nd loop of main iteration), branch_outcome1, ...

因此，分支预测器以某种方式在同一分支中找到了模式。

但是，如果我尝试测量单个分支指令，那么全局历史是： papi_start, branchoutcome1, papiend, papistart, branchoutcome2, papiend...

因此，我正在向全球历史介绍越来越多的分支。我假设全局历史不能包含许多分支条目，因此它在所需的 if 语句（分支）中找不到任何相关性/模式。

因此

我需要测量单个分支预测结果。我知道如果我不过多介绍papi，CPU可以学习200模式。我查看了 papi 调用，并且看到了很多 for 循环，如果条件。

这就是为什么我需要更好的测量。我尝试过 linux perf-event，但它会进行ioctl调用，这是一个系统调用，我用系统调用污染了全局历史记录，因此不是一个好的衡量标准。

我已经阅读了该指令rdpmc和rdmsr指令，并且我假设由于它们只是指令，因此我不会污染全局历史记录，并且我可以一次测量单个分支指令。

但是，我不知道如何做到这一点。我有 AMD 3600 CPU。这些是我在网上找到的链接，但我不知道该怎么做。除此之外，我还缺少什么吗？

英特尔 rdpmc

AMD 性能手册

score 5 · Accepted Answer

您假设 PAPI 和/或 perf_events 代码的占用空间相对较小。这是不正确的。如果您将性能计数器事件更改为“指令已停用”或“CPU 周期未停止”之类的内容，您将能够看到此操作在您的软件环境中包含多少开销。详细信息将取决于您的操作系统版本，但我预计开销将在数百条指令/数千个周期中，因为读取 perf_events 中的计数器（由 PAPI 使用）需要内核交叉。代码路径肯定会包含它自己的分支。

如果您的内核支持“用户模式 RDPMC”（CR4.PCE=1），您可以使用一条指令读取性能计数器。示例可在https://github.com/jdmccalpin/low-overhead-timers中找到。

即使将测量代码限制为本地 RDPMC 指令（以及用于保存结果的周围代码），测量也会对处理器流水线造成破坏。RDPMC 是一种微编码指令。在 Ryzen 内核上，该指令执行 20 个微操作，并且每 20 个周期具有一条指令的吞吐量。（参考：https ://www.agner.org/optimize/instruction_tables.pdf ）

任何细粒度的测量都是具有挑战性的，因为现代处理器的无序功能与用户代码的交互方式文档很少且难以预测。有关此主题的更多说明（也与 AMD 处理器相关）位于http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

score 5 · Accepted Answer

该perf_event_open() 文档描述了如何正确使用rdpmc通过该接口创建的事件。@JohnDMcCalpin 的答案中描述的方法也有效，但它基于直接对事件控制寄存器进行编程。给定一组硬件事件，弄清楚如何在可用的硬件性能计数器上安排这些事件可能很困难。perf_event子系统会为您处理这个问题，这是一个主要优势。

该perf_event子系统rdpmc从 Linux 3.4 开始支持。

从开始<linux/perf_event.h>，以下工作：

做perf_event_open()准备读计数器type = PERF_TYPE_HARDWARE config = PERF_COUNT_HW_BRANCH_MISSES

struct perf_event_attr attr ;
int fd ;

memset(&attr, 0, sizeof(attr)) ;

attr.type   = PERF_TYPE_HARDWARE ;
attr.config = PERF_COUNT_HW_BRANCH_MISSES;
attr.size = sizeof(attr) ;        // for completeness
attr.exclude_kernel = 1 ;         // count user-land events

perf_fd = (int)sys_perf_event_open(&attr, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) ;
                                  // this pid, any cpu, no group_fd

在哪里：

static long
sys_perf_event_open(struct perf_event_attr* attr,
                              pid_t pid, int cpu, int group_fd, ulong flags)
{
  return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags) ;
}

将 perf_fd 与 mmap 页面相关联：
```
struct perf_event_mmap_page* perf_mm ;

perf_mm = mmap(NULL, page_size, PROT_READ, MAP_SHARED, perf_fd, 0) ;
```
例如，page_size 可以是 4096。此缓冲区用于存储样本。请参阅文档的“溢出处理”部分。

要读取计数器，需要将其中的一些信息perf_mm与您使用RDPMC指令读取的内容结合起来，因此：

uint64_t  offset, count ;
uint32_t  lock, check, a, d, idx ;

lock = perf_mm->lock ;
do
  {
    check = lock ;
    __asm__ volatile("":::"memory") ;
    idx = perf_mm->index - 1 ;
    // Check that you're allowed to execute rdpmc. You can do this check once.
    // Check also that the event is currently active.
    // Starting with Linux 3.12, use cap_user_rdpmc.
    if (perf_mm->cap_user_rdpmc && idx) {
       // cap_user_rdpmc cannot change at this point because no code
       // that executes here that changes it. So it's safe.
       __asm__ volatile("\t rdpmc\n" : "=a" (a), "=d" (d) : "c" (idx)) ;
    }
    // In case of signed event counts, you have to use also pmc_width.
    // See the docs.
     offset = perf_mm->offset ;
    __asm__ volatile("":::"memory") ;
    lock = perf_mm->lock ;
  }
while (lock != check) ;

count = ((uint64_t)d << 32) + a ;
if (perf_mm->pmc_width != 64)
  {
    // need to sign extend the perf_mm->pmc_width bits of count.
  } ;
count += offset ;

如果线程在“开始”和“结束”读取之间没有中断，那么我认为我们可以假设这些perf_mm东西不会改变。但是如果它被中断，那么内核可以更新perf_mm一些东西来解释影响这个时间的任何变化。

注意：指令的开销RDPMC并不大，但我正在尝试剥离所有这些，看看我是否可以RDPMC直接使用结果，只要perf_mm->lock不改变。

c - 使用 rdmsr/rdpmc 进行分支预测精度

2 回答 2

Related

Reference