caching - 如何在实践中创建幽灵小工具？

Question

我正在开发（NASM + GCC 针对 ELF64）一个PoC，它使用一个幽灵小工具来测量访问一组缓存行（FLUSH+RELOAD）的时间。

如何制作可靠的幽灵小工具？

我相信我理解 FLUSH+RELOAD 技术背后的理论，但是在实践中，尽管有一些噪音，我无法产生一个有效的 PoC。

由于我使用的是时间戳计数器并且负载非常规律，因此我使用此脚本来禁用预取器、涡轮增压器并修复/稳定 CPU 频率：

#!/bin/bash

sudo modprobe msr

#Disable turbo
sudo wrmsr -a 0x1a0 0x4000850089

#Disable prefetchers
sudo wrmsr -a 0x1a4 0xf

#Set performance governor
sudo cpupower frequency-set -g performance

#Minimum freq
sudo cpupower frequency-set -d 2.2GHz

#Maximum freq
sudo cpupower frequency-set -u 2.2GHz

我有一个连续的缓冲区，在 4KiB 上对齐，大到足以跨越 256 个缓存行，由整数GAP行分隔。

SECTION .bss ALIGN=4096

 buffer:    resb 256 * (1 + GAP) * 64

我使用这个函数来刷新 256 行。

flush_all:
 lea rdi, [buffer]              ;Start pointer
 mov esi, 256                   ;How many lines to flush

.flush_loop:
  lfence                        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]                ;Touch the page
  lfence                        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]                ;Flush a line
  add rdi, (1 + GAP)*64         ;Move to the next line

  dec esi
 jnz .flush_loop                ;Repeat

 lfence                         ;clflush are ordered with respect of fences ..
                                ;.. and lfence is ordered (locally) with respect of all instructions
 ret

该函数循环遍历所有行，触摸中间的每一页（每页不止一次）并刷新每一行。

然后我使用这个函数来分析访问。

profile:
 lea rdi, [buffer]           ;Pointer to the buffer
 mov esi, 256                ;How many lines to test
 lea r8, [timings_data]      ;Pointer to timings results

 mfence                      ;I'm pretty sure this is useless, but I included it to rule out ..
                             ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence                     ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax               ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence                     ;Again, read the TSC in-order

  sub eax, ebp               ;Compute the delta

  mov DWORD [r8], eax        ;Save it

  ;Advance the loop

  add r8, 4                  ;Move the results pointer
  add rdi, (1 + GAP)*64      ;Move to the next line

  dec esi                    ;Advance the loop
 jnz .profile

 ret

附录中提供了 MCVE，并且可以克隆存储库。

当汇编GAP设置为 0 时，链接并执行taskset -c 0获取每一行所需的周期如下所示。

从内存中只加载了 64 行。

在不同的运行中输出是稳定的。如果我设置GAP为 1 只从内存中取出 32 行，当然是 64 * (1+0) * 64 = 32 * (1+1) * 64 = 4096，所以这可能与分页有关？

如果在分析之前（但在刷新之后）执行存储到前 64 行之一，则输出将更改为

其他行的任何存储都给出了第一种类型的输出。

我怀疑里面的数学是坏的，但我需要另一双眼睛找出哪里。

编辑

Hadi Brais在修复了输出现在不一致的问题后指出了对易失性寄存器的滥用。
我看到通常在时间较低（~50 个周期）的地方运行，有时在时间较高的地方（~130 个周期）运行。
我不知道 130 个周期的数字来自哪里（内存太低，缓存太高？）。

代码在 MCVE（和存储库）中是固定的。

如果在分析之前执行任何第一行的存储，则输出中不会反映任何更改。

附录 - MCVE

BITS 64
DEFAULT REL

GLOBAL main

EXTERN printf
EXTERN exit

;Space between lines in the buffer
%define GAP 0

SECTION .bss ALIGN=4096



 buffer:    resb 256 * (1 + GAP) * 64   


SECTION .data

 timings_data:  TIMES 256 dd 0


 strNewLine db `\n0x%02x: `, 0
 strHalfLine    db "  ", 0
 strTiming  db `\e[48;5;16`,
  .importance   db "0",
        db `m\e[38;5;15m%03u\e[0m `, 0  

 strEnd     db `\n\n`, 0

SECTION .text

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES
;
;

flush_all:
 lea rdi, [buffer]  ;Start pointer
 mov esi, 256       ;How many lines to flush

.flush_loop:
  lfence        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]    ;Touch the page
  lfence        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]    ;Flush a line
  add rdi, (1 + GAP)*64 ;Move to the next line

  dec esi
 jnz .flush_loop    ;Repeat

 lfence         ;clflush are ordered with respect of fences ..
            ;.. and lfence is ordered (locally) with respect of all instructions
 ret


;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER
;
;


profile:
 lea rdi, [buffer]      ;Pointer to the buffer
 mov esi, 256           ;How many lines to test
 lea r8, [timings_data]     ;Pointer to timings results


 mfence             ;I'm pretty sure this is useless, but I included it to rule out ..
                ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence            ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax          ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence            ;Again, read the TSC in-order

  sub eax, ebp          ;Compute the delta

  mov DWORD [r8], eax       ;Save it

  ;Advance the loop

  add r8, 4         ;Move the results pointer
  add rdi, (1 + GAP)*64     ;Move to the next line

  dec esi           ;Advance the loop
 jnz .profile

 ret

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;SHOW THE RESULTS
;
;

show_results:
 lea rbx, [timings_data]    ;Pointer to the timings
 xor r12, r12           ;Counter (up to 256)

.print_line:

 ;Format the output

 xor eax, eax
 mov esi, r12d
 lea rdi, [strNewLine]      ;Setup for a call to printf

 test r12d, 0fh
 jz .print          ;Test if counter is a multiple of 16

 lea rdi, [strHalfLine]     ;Setup for a call to printf

 test r12d, 07h         ;Test if counter is a multiple of 8
 jz .print

.print_timing:

  ;Print
  mov esi, DWORD [rbx]      ;Timing value

  ;Compute the color
  mov r10d, 60          ;Used to compute the color 
  mov eax, esi
  xor edx, edx
  div r10d          ;eax = Timing value / 78

  ;Update the color 


  add al, '0'
  mov edx, '5'
  cmp eax, edx
  cmova eax, edx
  mov BYTE [strTiming.importance], al

  xor eax, eax
  lea rdi, [strTiming]
  call printf WRT ..plt     ;Print a 3-digits number

  ;Advance the loop 

  inc r12d          ;Increment the counter
  add rbx, 4            ;Move to the next timing
  cmp r12d, 256
 jb .print_line         ;Advance the loop

  xor eax, eax
  lea rdi, [strEnd]
  call printf WRT ..plt     ;Print a new line

  ret

.print:

  call printf WRT ..plt     ;Print a string

jmp .print_timing

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;E N T R Y   P O I N T
;
;
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \

main:

 ;Flush all the lines of the buffer
 call flush_all

 ;Test the access times
 call profile

 ;Show the results
 call show_results

 ;Exit
 xor edi, edi
 call exit WRT ..plt

score 2 · Accepted Answer

缓冲区是从该bss部分分配的，因此在加载程序时，操作系统会将所有buffer缓存行映射到同一个 CoW 物理页面。在刷新所有行之后，只有对虚拟地址空间中前 64 行的访问在所有高速缓存级别¹中都未命中，因为所有^{2 个}后面的访问都是对同一个 4K 页面。这就是为什么前 64 次访问的延迟落在主存延迟范围内，并且所有后续访问的延迟等于 L1 命中延迟³时GAP为零的原因。

当GAP为 1 时，每隔一行访问同一物理页，因此主存访问次数（L3 未命中）为 32 次（64 次的一半）。也就是说，前 32 个延迟将在主内存延迟范围内，所有后面的延迟将是 L1 命中。类似地，当GAP为 63 时，所有访问都指向同一行。因此，只有第一次访问会丢失所有缓存。

解决方案是更改mov eax, [rdi]以flush_all确保mov dword [rdi], 0缓冲区分配在唯一的物理页面中。（可以删除中的lfence指令，flush_all因为 Intel 手册指出clflush不能用 writes ⁴重新排序。）这保证了在初始化和刷新所有行之后，所有访问都将错过所有缓存级别（但不是 TLB，请参阅：Does clflush also删除 TLB 条目？）。

可以参考为什么用户态L1 store miss事件只在有store初始化循环时才计算？另一个 CoW 页面可能具有欺骗性的示例。

我在此答案的先前版本中建议删除对 63 的调用flush_all并使用GAP值。通过这些更改，所有访问延迟似乎都非常高，我错误地断定所有访问都缺少所有缓存级别. 就像我上面所说的，GAP值为 63 时，所有访问都变成了同一个缓存行，它实际上驻留在 L1 缓存中。然而，所有延迟都很高的原因是因为每次访问都是针对不同的虚拟页面，并且 TLB 没有针对每个虚拟页面（到同一物理页面）的任何映射，因为通过删除对flush_all，之前没有任何虚拟页面被触摸过。因此测得的延迟表示 TLB 未命中延迟，即使正在访问的行位于 L1 缓存中。

我还在此答案的先前版本中错误地声称存在无法通过 MSR 0x1A4 禁用的 L3 预取逻辑。如果通过在 MSR 0x1A4 中设置其标志来关闭特定的预取器，那么它确实会完全关闭。除了英特尔记录的数据预取器之外，没有其他数据预取器。

脚注：

(1) 如果不禁用 DCU IP 预取器，它实际上会在刷新后将所有行预取回 L1，因此所有访问仍会在 L1 中命中。

(2) 在极少数情况下，中断处理程序的执行或在同一内核上调度其他线程可能会导致某些行从 L1 和可能的其他缓存层次结构中逐出。

(3) 请记住，您需要减去rdtscp指令的开销。请注意，您实际使用的测量方法无法可靠地区分 L1 命中和 L2 命中。请参阅：使用时间戳计数器测量内存延迟。

(4) 英特尔手册似乎没有具体说明是否clflush订购了读取，但在我看来是这样。

caching - 如何在实践中创建幽灵小工具？

1 回答 1

Related

Reference