我正在开发(NASM + GCC 针对 ELF64)一个PoC,它使用一个幽灵小工具来测量访问一组缓存行(FLUSH+RELOAD)的时间。
如何制作可靠的幽灵小工具?
我相信我理解 FLUSH+RELOAD 技术背后的理论,但是在实践中,尽管有一些噪音,我无法产生一个有效的 PoC。
由于我使用的是时间戳计数器并且负载非常规律,因此我使用此脚本来禁用预取器、涡轮增压器并修复/稳定 CPU 频率:
#!/bin/bash
sudo modprobe msr
#Disable turbo
sudo wrmsr -a 0x1a0 0x4000850089
#Disable prefetchers
sudo wrmsr -a 0x1a4 0xf
#Set performance governor
sudo cpupower frequency-set -g performance
#Minimum freq
sudo cpupower frequency-set -d 2.2GHz
#Maximum freq
sudo cpupower frequency-set -u 2.2GHz
我有一个连续的缓冲区,在 4KiB 上对齐,大到足以跨越 256 个缓存行,由整数GAP行分隔。
SECTION .bss ALIGN=4096
buffer: resb 256 * (1 + GAP) * 64
我使用这个函数来刷新 256 行。
flush_all:
lea rdi, [buffer] ;Start pointer
mov esi, 256 ;How many lines to flush
.flush_loop:
lfence ;Prevent the previous clflush to be reordered after the load
mov eax, [rdi] ;Touch the page
lfence ;Prevent the current clflush to be reordered before the load
clflush [rdi] ;Flush a line
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi
jnz .flush_loop ;Repeat
lfence ;clflush are ordered with respect of fences ..
;.. and lfence is ordered (locally) with respect of all instructions
ret
该函数循环遍历所有行,触摸中间的每一页(每页不止一次)并刷新每一行。
然后我使用这个函数来分析访问。
profile:
lea rdi, [buffer] ;Pointer to the buffer
mov esi, 256 ;How many lines to test
lea r8, [timings_data] ;Pointer to timings results
mfence ;I'm pretty sure this is useless, but I included it to rule out ..
;.. silly, hard to debug, scenarios
.profile:
mfence
rdtscp
lfence ;Read the TSC in-order (ignoring stores global visibility)
mov ebp, eax ;Read the low DWORD only (this is a short delay)
;PERFORM THE LOADING
mov eax, DWORD [rdi]
rdtscp
lfence ;Again, read the TSC in-order
sub eax, ebp ;Compute the delta
mov DWORD [r8], eax ;Save it
;Advance the loop
add r8, 4 ;Move the results pointer
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi ;Advance the loop
jnz .profile
ret
附录中提供了 MCVE,并且可以克隆存储库。
当汇编GAP
设置为 0 时,链接并执行taskset -c 0
获取每一行所需的周期如下所示。
从内存中只加载了 64 行。
在不同的运行中输出是稳定的。如果我设置GAP
为 1 只从内存中取出 32 行,当然是 64 * (1+0) * 64 = 32 * (1+1) * 64 = 4096,所以这可能与分页有关?
如果在分析之前(但在刷新之后)执行存储到前 64 行之一,则输出将更改为
其他行的任何存储都给出了第一种类型的输出。
我怀疑里面的数学是坏的,但我需要另一双眼睛找出哪里。
编辑
Hadi Brais在修复了输出现在不一致的问题后指出了对易失性寄存器的滥用。
我看到通常在时间较低(~50 个周期)的地方运行,有时在时间较高的地方(~130 个周期)运行。
我不知道 130 个周期的数字来自哪里(内存太低,缓存太高?)。
代码在 MCVE(和存储库)中是固定的。
如果在分析之前执行任何第一行的存储,则输出中不会反映任何更改。
附录 - MCVE
BITS 64
DEFAULT REL
GLOBAL main
EXTERN printf
EXTERN exit
;Space between lines in the buffer
%define GAP 0
SECTION .bss ALIGN=4096
buffer: resb 256 * (1 + GAP) * 64
SECTION .data
timings_data: TIMES 256 dd 0
strNewLine db `\n0x%02x: `, 0
strHalfLine db " ", 0
strTiming db `\e[48;5;16`,
.importance db "0",
db `m\e[38;5;15m%03u\e[0m `, 0
strEnd db `\n\n`, 0
SECTION .text
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES
;
;
flush_all:
lea rdi, [buffer] ;Start pointer
mov esi, 256 ;How many lines to flush
.flush_loop:
lfence ;Prevent the previous clflush to be reordered after the load
mov eax, [rdi] ;Touch the page
lfence ;Prevent the current clflush to be reordered before the load
clflush [rdi] ;Flush a line
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi
jnz .flush_loop ;Repeat
lfence ;clflush are ordered with respect of fences ..
;.. and lfence is ordered (locally) with respect of all instructions
ret
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER
;
;
profile:
lea rdi, [buffer] ;Pointer to the buffer
mov esi, 256 ;How many lines to test
lea r8, [timings_data] ;Pointer to timings results
mfence ;I'm pretty sure this is useless, but I included it to rule out ..
;.. silly, hard to debug, scenarios
.profile:
mfence
rdtscp
lfence ;Read the TSC in-order (ignoring stores global visibility)
mov ebp, eax ;Read the low DWORD only (this is a short delay)
;PERFORM THE LOADING
mov eax, DWORD [rdi]
rdtscp
lfence ;Again, read the TSC in-order
sub eax, ebp ;Compute the delta
mov DWORD [r8], eax ;Save it
;Advance the loop
add r8, 4 ;Move the results pointer
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi ;Advance the loop
jnz .profile
ret
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;SHOW THE RESULTS
;
;
show_results:
lea rbx, [timings_data] ;Pointer to the timings
xor r12, r12 ;Counter (up to 256)
.print_line:
;Format the output
xor eax, eax
mov esi, r12d
lea rdi, [strNewLine] ;Setup for a call to printf
test r12d, 0fh
jz .print ;Test if counter is a multiple of 16
lea rdi, [strHalfLine] ;Setup for a call to printf
test r12d, 07h ;Test if counter is a multiple of 8
jz .print
.print_timing:
;Print
mov esi, DWORD [rbx] ;Timing value
;Compute the color
mov r10d, 60 ;Used to compute the color
mov eax, esi
xor edx, edx
div r10d ;eax = Timing value / 78
;Update the color
add al, '0'
mov edx, '5'
cmp eax, edx
cmova eax, edx
mov BYTE [strTiming.importance], al
xor eax, eax
lea rdi, [strTiming]
call printf WRT ..plt ;Print a 3-digits number
;Advance the loop
inc r12d ;Increment the counter
add rbx, 4 ;Move to the next timing
cmp r12d, 256
jb .print_line ;Advance the loop
xor eax, eax
lea rdi, [strEnd]
call printf WRT ..plt ;Print a new line
ret
.print:
call printf WRT ..plt ;Print a string
jmp .print_timing
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;E N T R Y P O I N T
;
;
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
main:
;Flush all the lines of the buffer
call flush_all
;Test the access times
call profile
;Show the results
call show_results
;Exit
xor edi, edi
call exit WRT ..plt