assembly - 有多少种方法可以将寄存器设置为零？

Question

我很好奇在 x86 汇编中有多少种方法可以将寄存器设置为零。使用一条指令。有人告诉我，他设法找到至少 10 种方法来做到这一点。

我能想到的有：

xor ax,ax
mov ax, 0
and ax, 0

score 18 · Accepted Answer

在 IA32 下如何将 0 移动到 ax 有很多可能性......

    lea eax, [0]
    mov eax, 0FFFF0000h         //All constants form 0..0FFFFh << 16
    shr  ax, 16                 //All constants form 16..31
    shl eax, 16                 //All constants form 16..31

也许是最奇怪的...... :)

@movzx:
    movzx eax, byte ptr[@movzx + 6]   //Because the last byte of this instruction is 0

并且还在 32 位模式下（较长的指令将最终（最重要的）地址字节放在后面）......

  @movzx:
    movzx ax, byte ptr[@movzx + 7]

编辑：

对于 16 位 x86 cpu 模式，未经测试...：

    lea  ax, [0]

和...

  @movzx:
    movzx ax, byte ptr cs:[@movzx + 7]   //Check if 7 is right offset

如果ds段寄存器不等于 cs 段寄存器，则cs:前缀是可选的。

score 13 · Accepted Answer

有关将寄存器归零的最佳方法，请参阅此答案xor eax,eax：（性能优势和更小的编码）。

我将只考虑单个指令可以将寄存器归零的方式。如果您允许从内存中加载零，则方法太多，因此我们将主要排除从内存中加载的指令。

我发现了 10 条不同的单指令，它们将 32 位寄存器（以及长模式下的完整 64 位寄存器）归零，没有前置条件或从任何其他内存加载。这不包括相同 insn 的不同编码或不同形式的mov. 如果您计算从已知为零的内存或段寄存器或其他任何内容加载，那么有很多方法。还有无数种方法可以将向量寄存器归零。

对于其中的大多数，eax 和 rax 版本是相同功能的不同编码，都将完整的 64 位寄存器归零，或者隐式归零上半部分，或者使用 REX.W 前缀显式写入完整寄存器。

整数寄存器（NASM 语法）：

# Works on any reg unless noted, usually of any size.  eax/ax/al as placeholders
and    eax, 0         ; three encodings: imm8, imm32, and eax-only imm32
andn   eax, eax,eax   ; BMI1 instruction set: dest = ~s1 & s2
imul   eax, any,0     ; eax = something * 0.  two encodings: imm8, imm32
lea    eax, [0]       ; absolute encoding (disp32 with no base or index).  Use [abs 0] in NASM if you used DEFAULT REL
lea    eax, [rel 0]   ; YASM supports this, but NASM doesn't: use a RIP-relative encoding to address a specific absolute address, making position-dependent code

mov    eax, 0         ; 5 bytes to encode (B8 imm32)
mov    rax, strict dword 0   ; 7 bytes: REX mov r/m64, sign-extended-imm32.    NASM optimizes mov rax,0 to the 5B version, but dword or strict dword stops it for some reason
mov    rax, strict qword 0   ; 10 bytes to encode (REX B8 imm64).  movabs mnemonic for AT&T.  normally assemblers choose smaller encodings if the operand fits, but strict qword forces the imm64.

sub    eax, eax       ; recognized as a zeroing idiom on some but maybe not all CPUs
xor    eax, eax       ; Preferred idiom: recognized on all CPUs
                      ; 2 same-size encodings each: r/m, r  vs.  r, r/m

@movzx:
  movzx eax, byte ptr[@movzx + 6]   //Assuming the high byte of the absolute address is 0.  Not position-independent, and x86-64 RIP+rel32 would load 0xFF

.l: loop .l             ; clears e/rcx... eventually.  from I. J. Kennedy's answer.  To operate on only ECX, use an address-size prefix.
; rep lodsb             ; not counted because it's not safe (potential segfaults), but also zeros ecx

像这样的指令xor reg,reg 可以用两种不同的方式进行编码。在 GAS AT&T 语法中，我们可以请求汇编器选择哪个操作码。这仅适用于允许两种形式的 reg,reg 整数指令，即可以追溯到 8086。因此不适用于 SSE/AVX。

  {load}  xor %eax, %eax           # 31 c0
  {store} xor %eax, %eax           # 33 c0

“将所有位移出一端”对于常规大小的 GP 寄存器是不可能的，只有部分寄存器是不可能的。 shl并且shr班次计数被屏蔽（在 286 及更高版本上）：count & 31;即 mod 32。

（立即计数移位在 186 中是新的（以前只有 CL 和隐式 1），因此有些 CPU 具有未屏蔽的立即移位（也包括 NEC V30）。此外，286 和更早版本只有 16 位，所以ax是“完整的”寄存器。有些 CPU 中的移位可以将一个完整的整数寄存器归零。）

另请注意，向量的移位计数饱和而不是环绕。

# Zeroing methods that only work on 16bit or 8bit regs:
shl    ax, 16           ; shift count is still masked to 0x1F for any operand size less than 64b.  i.e. count %= 32
shr    al, 16           ; so 8b and 16b shifts can zero registers.

# zeroing ah/bh/ch/dh:  Low byte of the reg = whatever garbage was in the high16 reg
movxz  eax, ah          ; From Jerry Coffin's answer

根据其他现有条件（除了在另一个 reg 中有零）：

bextr  eax,  any, eax  ; if al >= 32, or ah = 0.  BMI1
BLSR   eax,  src       ; if src only has one set bit
CDQ                    ; edx = sign-extend(eax)
sbb    eax, eax        ; if CF=0.  (Only recognized on AMD CPUs as dependent only on flags (not eax))
setcc  al              ; with a condition that will produce a zero based on known state of flags

PSHUFB   xmm0, all-ones  ; xmm0 bytes are cleared when the mask bytes have their high bit set

矢量注册

其中一些 SSE2 整数指令也可用于 MMX 寄存器 ( mm0- mm7)。 我不打算单独展示。

同样，最好的选择是某种形式的异或。要么PXOR/ VPXOR，要么XORPS/ VXORPS。请参阅在 x86 程序集中将寄存器设置为零的最佳方法是什么：xor、mov 或 and？详情。

AVXvxorps xmm0,xmm0,xmm0将完整的 ymm0/zmm0 归零，并且优于vxorps ymm0,ymm0,ymm0AMD CPU。

这些归零指令各有三种编码：传统 SSE、AVX（VEX 前缀）和 AVX512（EVEX 前缀），尽管 SSE 版本仅将底部 128 归零，这不是支持 AVX 或 AVX512 的 CPU 上的完整寄存器。无论如何，根据您的计数方式，每个条目可以是三个不同的指令（尽管操作码相同，只是前缀不同）。除了vzeroall，哪个 AVX512 没有改变（并且没有将 zmm16-31 归零）。

PXOR       xmm0, xmm0     ;; recommended
XORPS      xmm0, xmm0     ;; or this
XORPD      xmm0, xmm0     ;; longer encoding for zero benefit
PXOR       mm0, mm0     ;; MMX, not show for the rest of the integer insns

ANDNPD    xmm0, xmm0
ANDNPS    xmm0, xmm0
PANDN     xmm0, xmm0     ; dest = ~dest & src

PCMPGTB   xmm0, xmm0     ; n > n is always false.
PCMPGTW   xmm0, xmm0     ; similarly, pcmpeqd is a good way to do _mm_set1_epi32(-1)
PCMPGTD   xmm0, xmm0
PCMPGTQ   xmm0, xmm0     ; SSE4.2, and slower than byte/word/dword

PSADBW    xmm0, xmm0     ; sum of absolute differences
MPSADBW   xmm0, xmm0, 0  ; SSE4.1.  sum of absolute differences, register against itself with no offset.  (imm8=0: same as PSADBW)

  ; shift-counts saturate and zero the reg, unlike for GP-register shifts
PSLLDQ    xmm0, 16       ;  left-shift the bytes in xmm0
PSRLDQ    xmm0, 16       ; right-shift the bytes in xmm0
PSLLW     xmm0, 16       ; left-shift the bits in each word
PSLLD     xmm0, 32       ;           double-word
PSLLQ     xmm0, 64       ;             quad-word
PSRLW/PSRLD/PSRLQ  ; same but right shift

PSUBB/W/D/Q   xmm0, xmm0     ; subtract packed elements, byte/word/dword/qword
PSUBSB/W   xmm0, xmm0     ; sub with signed saturation
PSUBUSB/W  xmm0, xmm0     ; sub with unsigned saturation

;; SSE4.1
INSERTPS   xmm0, xmm1, 0x0F   ; imm[3:0] = zmask = all elements zeroed.
DPPS       xmm0, xmm1, 0x00   ; imm[7:4] => inputs = treat as zero -> no FP exceptions.  imm[3:0] => outputs = 0 as well, for good measure
DPPD       xmm0, xmm1, 0x00   ; inputs = all zeroed -> no FP exceptions.  outputs = 0

VZEROALL                      ; AVX1  x/y/zmm0..15 not zmm16..31
VPERM2I/F128  ymm0, ymm1, ymm2, 0x88   ; imm[3] and [7] zero that output lane

# Can raise an exception on SNaN, so only usable if you know exceptions are masked
CMPLTPD    xmm0, xmm0         # exception on QNaN or SNaN, or denormal
VCMPLT_OQPD xmm0, xmm0,xmm0   # exception only on SNaN or denormal
CMPLT_OQPS ditto

VCMPFALSE_OQPD xmm0, xmm0, xmm0   # This is really just another imm8 predicate value for the same VCMPPD xmm,xmm,xmm, imm8 instruction.  Same exception behaviour as LT_OQ.

SUBPS xmm0, xmm0并且类似的方法不起作用，因为 NaN-NaN = NaN，而不是零。

此外，FP 指令可能会引发 NaN 参数的异常，因此即使 CMPPS/PD 只有在您知道异常被屏蔽并且您不关心可能在 MXCSR 中设置异常位时才是安全的。即使是 AVX 版本，具有扩展的谓词选择，也将#IA在 SNaN 上提出。“安静”谓词仅抑制#IAQNaN。CMPPS/PD 也可以引发非正规异常。（AVX512 EVEX 编码可以抑制512 位向量的 FP 异常，同时覆盖舍入模式）

（请参阅CMPPD 的 insn set ref 条目中的表格，或者最好在英特尔的原始 PDF 中，因为 HTML 提取会破坏该表格。）

上面的 AVX1/2 和 AVX512 EVEX 形式，仅用于 PXOR：这些都为零完整的 ZMM 目标。PXOR 有两个 EVEX 版本：VPXORD 或 VPXORQ，允许使用 dword 或 qword 元素进行屏蔽。（XORPS/PD 已经在助记符中区分了元素大小，因此 AVX512 没有改变这一点。在传统的 SSE 编码中，与所有 CPU 上的 XORPS 相比，XORPD 总是毫无意义地浪费代码大小（更大的操作码）。）

VPXOR      xmm15, xmm0, xmm0      ; AVX1 VEX
VPXOR      ymm15, ymm0, ymm0      ; AVX2 VEX, less efficient on some CPUs
VPXORD     xmm31, xmm0, xmm0      ; AVX512VL EVEX
VPXORD     ymm31, ymm0, ymm0      ; AVX512VL EVEX 256-bit
VPXORD     zmm31, zmm0, zmm0      ; AVX512F EVEX 512-bit

VPXORQ     xmm31, xmm0, xmm0      ; AVX512VL EVEX
VPXORQ     ymm31, ymm0, ymm0      ; AVX512VL EVEX 256-bit
VPXORQ     zmm31, zmm0, zmm0      ; AVX512F EVEX 512-bit

英特尔的 PXOR 手册条目中列出了不同的向量宽度和单独的条目。

您可以对所需的任何掩码寄存器使用零掩码（但不能合并掩码）；无论您是从掩码中获得零还是从向量指令的正常输出中获得零都没有关系。但这不是一个不同的指令。例如： VPXORD xmm16{k1}{z}, xmm0, xmm0

AVX512：

这里可能有几个选项，但我现在还没有足够的好奇心去挖掘指令集列表来寻找所有选项。

不过，有一个有趣的事情值得一提：VPTERNLOGD/Q可以将寄存器设置为全一，imm8 = 0xFF。（但在当前实现上对旧值有错误的依赖）。由于比较指令都比较为掩码，因此在我的测试中，VPTERNLOGD 似乎是在 Skylake-AVX512 上将向量设置为全一的最佳方法，尽管它没有特殊情况 imm8=0xFF 情况以避免错误依赖。

VPTERNLOGD zmm0, zmm0,zmm0, 0     ; inputs can be any registers you like.

掩码寄存器 (k0..k7) 归零： 掩码指令和向量比较到掩码

kxorB/W/D/Q     k0, k0, k0     ; narrow versions zero extend to max_kl
kshiftlB/W/D/Q  k0, k0, 100    ; kshifts don't mask/wrap the 8-bit count
kshiftrB/W/D/Q  k0, k0, 100
kandnB/W/D/Q    k0, k0, k0     ; x & ~x

; compare into mask
vpcmpB/W/D/Q    k0, x/y/zmm0, x/y/zmm0, 3    ; predicate #3 = always false; other predicates are false on equal as well
vpcmpuB/W/D/Q   k0, x/y/zmm0, x/y/zmm0, 3    ; unsigned version

vptestnmB/W/D/Q k0, x/y/zmm0, x/y/zmm0       ; x & ~x test into mask

x87 FP：

只有一个选择（因为如果旧值是无穷大或 NaN，则 sub 不起作用）。

FLDZ    ; push +0.0

score 4 · Accepted Answer

还有几个可能性：

sub ax, ax

movxz, eax, ah

编辑：我应该注意，movzx它并不是全部归零eax——它只是归零ah（加上本身不能作为寄存器访问的前 16 位）。

至于最快，如果记忆服务sub和xor是等价的。它们比（大多数）其他更快，因为它们足够常见，CPU 设计人员为它们添加了特殊优化。具体来说，与正常sub或xor结果取决于寄存器中的先前值。CPU 会特别识别 xor-with-self 和 sub-from-self ，因此它知道依赖链在那里被破坏了。之后的任何指令都不会依赖于任何先前的值，因此它可以使用重命名寄存器并行执行先前和后续指令。

特别是在较旧的处理器上，我们预计 'mov reg, 0' 会更慢，因为它有额外的 16 位数据，并且大多数早期处理器（尤其是 8088）主要受限于它们从内存加载流的能力 - - 事实上，在 8088 上，您可以使用任何参考表非常准确地估计运行时间，只需注意所涉及的字节数。div对于andidiv指令，这确实分解了，但仅此而已。OTOH，我可能应该闭嘴，因为 8088 确实对任何人都不感兴趣（至少十年来）。

score 3 · Accepted Answer

当然，特定情况还有其他方法可以将寄存器设置为 0：例如，如果您已eax设置为正整数，则可以edx使用 a 设置为 0 cdq/cltd（此技巧用于著名的 24 字节 shellcode，它出现在“不安全编程”中举个例子”）。

score 3 · Accepted Answer

3

您可以使用将寄存器 CX 设置为 0 LOOP $。

于 2011-01-31T17:32:23.453 回答

score 3 · Accepted Answer

该线程很旧，但还有其他一些示例。简单的：

xor eax,eax

sub eax,eax

and eax,0

lea eax,[0] ; it doesn't look "natural" in the binary

更复杂的组合：

; flip all those 1111... bits to 0000
or  eax,-1  ;  eax = 0FFFFFFFFh
not eax     ; ~eax = 0

; XOR EAX,-1 works the same as NOT EAX instruction in this case, flipping 1 bits to 0
or  eax,-1  ;  eax = 0FFFFFFFFh
xor eax,-1  ; ~eax = 0

; -1 + 1 = 0
or  eax,-1 ;  eax = 0FFFFFFFFh or signed int = -1
inc eax    ;++eax = 0

score 1 · Accepted Answer

根据DEF CON 25 - XlogicX - 汇编语言级别太高：

直接基数为 0 的 AAD 将始终 AH 为零，并且保持 AL 不变。来自英特尔的伪代码：
AL ← (oldAL + (oldAH ∗ imm8)) AND FFH;

在 asm 源中：

AAD 0         ; assemblers like NASM accept this

db 0xd5,0x00  ; others many need you to encode it manually

显然（至少在某些 CPU 上），AX 前面的 66 个操作数大小的前缀bswap eax（即66 0F C8作为编码的尝试bswap ax）为零。

score 1 · Accepted Answer

在评论中，OP 写道，班次不能使用立即计数（随 80186/80286 引入）。因此，目标 x86 CPU 必须是 8086/8088。（10 年前这个问题用 [8086] 标记肯定比最近（5 年？）引入的 [x86-16] 更好）

8086 架构提供 14 个基本程序执行寄存器，用于一般系统和应用程序编程。这些寄存器可以分组如下：

• AX、BX、CX、DX、SI、DI、BP和SP通用寄存器。这八个寄存器可用于存储操作数和指针。
• CS、DS、ES和SS段寄存器。这些寄存器允许寻址超过 64KB 的内存。
•FLAGS登记册。该寄存器报告正在执行的程序的状态，并允许处理器的应用程序级控制。
•IP登记册。该指令指针寄存器包含一个 16 位指针，指向要执行的下一条指令。

因此，关于在 x86 上清除寄存器的问题的答案可以处理将上述任何寄存器归零，当然除了在FLAGS架构上定义为始终在其第二位位置保持 1 的寄存器。

接下来是可以清除 8086 上的寄存器且不依赖任何预先存在的条件的单个指令列表。该列表按字母顺序排列：

encoding         instruction                register cleared           displacement
--------------   ---------------            -----------------------    ------------
25 00 00         and     ax, 0              AX
83 E0 00         and     ax, 0              AX BX CX DX SI DI BP SP
81 E0 00 00      and     ax, 0              AX BX CX DX SI DI BP SP
E8 -- --         call    0000h              IP                         -($+3)
9A 00 00 xx yy   call    yyxxh:0000h        IP
9A xx yy 00 00   call    0000h:yyxxh        CS
9A 00 00 00 00   call    0000h:0000h  (*)   IP and CS
E9 -- --         jmp     0000h              IP                         -($+3)
EA 00 00 xx yy   jmp     yyxxh:0000h        IP
EA xx yy 00 00   jmp     0000h:yyxxh        CS
EA 00 00 00 00   jmp     0000h:0000h  (*)   IP and CS
8D 06 00 00      lea     ax, [0000h]        AX BX CX DX SI DI BP SP
F3 AC            rep lodsb                  CX
F3 AD            rep lodsw                  CX
E2 FE            loop    $                  CX
B8 00 00         mov     ax, 0              AX BX CX DX SI DI BP SP
C7 C0 00 00      mov     ax, 0              AX BX CX DX SI DI BP SP
F3 A4            rep movsb            (*)   CX
F3 A5            rep movsw            (*)   CX
F3 AA            rep stosb            (*)   CX
F3 AB            rep stosw            (*)   CX
29 C0            sub     ax, ax             AX BX CX DX SI DI BP SP
2B C0            sub     ax, ax             AX BX CX DX SI DI BP SP
31 C0            xor     ax, ax             AX BX CX DX SI DI BP SP
33 C0            xor     ax, ax             AX BX CX DX SI DI BP SP

这个列表显示了技术上可行的，当然不是你应该使用的。标有 (*) 的说明非常危险或只能谨慎使用。
不用说，call为了jmp工作，您需要在目标位置执行代码。

清除通用寄存器的最佳方法是使用xor reg, reg，如果您不想更改任何标志，请使用mov reg, 0.

score -2 · Accepted Answer

mov eax,0  
shl eax,32  
shr eax,32  
imul eax,0 
sub eax,eax 
xor eax,eax   
and eax,0  
andn eax,eax,eax 

loop $ ;ecx only  
pause  ;ecx only (pause="rep nop" or better="rep xchg eax,eax")

;twogether:  
push dword 0    
pop eax

or eax,0xFFFFFFFF  
not eax

xor al,al ;("mov al,0","sub al,al",...)  
movzx eax,al
...

assembly - 有多少种方法可以将寄存器设置为零？

9 回答 9

整数寄存器（NASM 语法）：

矢量注册

AVX512：

x87 FP：

Related

Reference