assembly - x86 程序集 - 将 rax 优化为 [0 .. 限制）

Question

我正在编写一个简单的汇编程序，当然，它旨在尽可能快。然而，位于最嵌套循环中的某个部分似乎并不“正确”，我相信有可能提出更聪明、更快的实现，甚至可能不使用条件跳转。该代码实现了一个简单的事情：

if rax < 0 then rax := 0 else if rax >= r12 then rax := r12 - 1

这是我天真的实现：

cmp rax, 0
jge offsetXGE
   mov rax, 0
   jmp offsetXReady
offsetXGE:
   cmp rax, r12
   jl offsetXReady
   mov rax, r12
   dec rax
offsetXReady:

欢迎任何想法，即使是那些使用 MMX 和一些掩蔽技巧的想法。

编辑：回答评论中的一些问题 - 是的，我们可以假设 r12 > 0 但 rax 可以是负数。

score 5 · Accepted Answer

Branchy：一个比较分支，一个 LEA + cmov。

无分支：1 个 CMP，2 个单 uop CMOV

（或者在@Martijn 的答案中使用相同的技巧完全无分支，如果超出范围的值很常见但不可预测，这可能会很好。）

将标量数据移动到向量 regs 以执行一两条指令，然后将其移回是不值得的。如果您可以一次有效地处理整个向量，那么您可以使用PMINSD/PMAXSD将值钳制在有符号范围内。（或 2x packssdw+packuswb将 4 个输入向量压缩成一个 8 位元素向量，最终无符号饱和。）

在您的原作中，有几件事显然不是最理想的。大多数情况下，前两个只对代码大小很重要，但LEA对于非破坏性添加来说，这是一个小而明显的胜利：

cmp eax, 0 应该test eax, eax
mov rax, 0 应该是xor eax, eax。不，eax不是rax.
mov rax, r12 / dec rax 应该是lea rax, [r12 - 1]。

请参阅x86 wiki中的链接，尤其是。阿格纳雾的指南。

如果钳位不常见，则完全需要它：

您需要一个寄存器（或内存位置）来保存0，否则需要一条额外的指令到mov reg, 0.

    ...
    cmp  rax, r12
    jae  .clamp      ; not-taken fast-path = no clamping
.clamp_finished:

    ...
    ret

.clamp:   
    ; flags still set from the cmp rax, r12
    ; we only get here if rax is >= r12 (`ge` signed compare), or negative (so `l` rax < r12 indicates rax<0)

    ; mov r15d, 0        ; or zero it outside the loop so it can be used when needed.  Can't xor-zero because we need to preserve flags

    lea    rax, [r12-1]  ; still doesn't modify flags
    cmovl  rax, r15      ; rax=0 if  orig_rax<r12 (signed), which means we got here because orig_rax<0
    jmp  .clamp_finished

英特尔 Skylake 的快速性能分析：

快速路径：一个未采用的比较和分支微指令。rax 的延迟：0 个周期。
需要钳位的情况：一个采用比较和分支微指令，再加上 3 个微指令（lea，1 个用于 cmov，1 个用于 jmp 返回。） rax 的延迟：RAX 或 R12 准备就绪后的 2 个周期（cmp + lea 并行，然后 cmov 从 cmp 读取 FLAGS）。

在 Intel Haswell 或更早版本上，cmovl 为 2 微秒，并且在关键路径上花费了一个额外的延迟周期，因此总共 3 个。

显然，您可以使用jb而不是jae跳过夹紧lea/cmov，而不是将它们拉出主流。请参阅下面的部分以了解其动机。（和/或查看 Anatolyg 的出色答案，其中涵盖了这一点。我也得到了一个很酷的技巧，即使用Anatolyg 的答案中的一个分支来完成）jb。[0 .. limit]

我认为使用 jae/cmov 的版本是最好的选择，尽管 cmov 有很多缺点并且并不总是更快。已经需要它的输入操作数，因此即使需要钳位也不会增加太多延迟。

.clamp不需要归零寄存器的代码块的替代分支实现是：

.clamp:
    lea    rax, [r12-1]
    jge  .clamp_finished
    xor    eax, eax
    jmp  .clamp_finished

它仍然计算它可能丢弃的结果，cmov 样式。但是，下面的 xor 会启动一个新的依赖链，因此如果 xor-zeroing 执行，它不必等待lea写入。rax

一个重要的问题是您希望多久使用一次这些分支。 如果有一个常见的情况（例如，没有钳位的情况），请让它成为通过代码的快速路径（尽可能少的指令和尽可能少的分支）。根据不经常使用分支的频率，将不常见情况的代码放在函数末尾可能是值得的。

func:
    ...
    test
    jcc .unlikely
    ...        
.ret_from_unlikely:
    ...
    ... ;; lots of code
    ret

.unlikely:
    xor   eax,eax
    jmp .ret_from_unlikely   ;; this extra jump makes the slow path slower, but that's worth it to make the fast path faster.

Gcc 这样做，我认为当它决定不太可能采用分支时。因此，不是让典型案例采用跳过一些指令的分支，而是让常见案例失败。通常，默认分支预测不会用于前向跳转，因此在看到不太可能的情况之前，它甚至不需要分支预测器条目。

随机想法：代码

if (eax < 0) { eax = 0; }
else if (eax >= r12) { eax := r12 - 1 }  // If r12 can be zero, the else matters

相当于

eax = min(eax, r12-1);
eax = max(eax, 0);

r12不能为负，但 OP 并没有说它不能为零。这种排序保留了 if/else 语义。（编辑：实际上 OP 确实说过你可以假设 r12>0，而不是 >=0。）如果我们在 asm 中有一个快速的 min/max，我们可以在这里使用它。vector-max 是单指令，但标量需要更多代码。

相关代码审查：将整数限制在 0-255 范围内的最快方法。但那是在 C 中，编译器生成的 asm 版本都不是最佳的。尽管如此，它还是提供了一些初步的灵感。

同样相关的是，当前的 clang 使 std::clamp 悲观于存储、选择指针和重新加载。 https://bugs.llvm.org/show_bug.cgi?id=47271

TODO：使用此钳位窥视孔归档错过的优化错误报告，以便编译器可以查找它。

目前，我的版本可以无分支地钳制到[0, limit](两端的封闭范围，因此要限制而不是 limit-1。这需要一些技巧来完成它，同时避免cmova/ cmovbe(与大多数 CMOV 谓词不同，在 Intel 上仍然是 2 uops只读取 CF或一些 SPAZO 标志。）

# gcc -nostdlib -static testloop.S -o testloop && 
# taskset -c 3 perf stat --all-user -etask-clock:u,context-switches:u,cpu-migrations:u,page-faults:u,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,idq.dsb_uops:u -r1 ./testloop

# or idq.mite_uops to make sure it's low

.intel_syntax noprefix
.global _start
_start:

    mov     edi, 34
    mov     esi, 100
    xor     ecx, ecx

    mov     ebp, 100000000

.p2align 6
.loop:

# ~3.10 cycles latency, with unroll or an imul to give OoO scheduling an easier time. Can be 3.3c in worse cases.
.macro  clamp0n  dst, n, high
    xor    \dst, \dst
    cmp    \n, \high
    cmovg  \dst, \high       # prepare clamped value: n>high [signed] ? high : 0
    cmovbe \dst, \n          # copy original if n <= high  [unsigned], i.e. in range
.endm

# ~4.00 cycles latency, no ILP for 2-uop cmovbe
.macro  clamp0n_rev  dst, n, high
    xor    \dst, \dst
    cmp    \n, \high
    cmovbe \dst, \n          # copy original if n <= high [unsigned]; no clamping
    cmovg  \dst, \high       # high if n>high (replacing 0), else leave orig
.endm

# ~3.00 cycles latency, only single-uop CMOV
.macro  clamp0n_rev_intel  dst, n, high
    xor    \dst, \dst
    cmp    \n, \high
    cmovb  \dst, \n          # copy original if n < high [unsigned]; no clamping.  (cmovbe is 2 uops on Intel, let next insn handle that case)
    cmovge \dst, \high       # high if n>=high (replacing 0), else leave orig.  copy on equal restores the value destroyed by cmovb
.endm

# ~3.1 to 3.3 cycle latency
.macro  clamp0n_inplace_destroy_zero  n, high, tmp
    xor    \tmp, \tmp
    cmp    \n, \high
    cmovg  \tmp, \high       # prepare clamped value: 0 or high, per signed compare.
    cmova  \n, \tmp         # if clamping needed at all, apply clamped value
.endm

# 4.0 cycles latency.
.macro  clamp0n_inplace  n, high, zero
    cmp    \n, \high
    cmova  \n, \zero        # if clamping needed at all, apply 0.  2 uops on Intel.  could be 2nd if we destroy \high?
    cmovg  \n, \high        # if signed greater than limit, clamp to high
.endm

# 3.0 cycles latency, only single uop CMOV.
.macro  clamp0n_inplace_intel  n, high, zero
    cmp    \n, \high
    cmovae  \n, \zero        # if clamping needed at all, apply 0.  (or on equal, to avoid 2-uop cmov)
    cmovge  \n, \high        # if signed greater than limit, clamp to high. (or on equal, to restore the correct value)
.endm

#define CLAMP_INPLACE clamp0n_inplace_intel
#define CLAMP_COPY    clamp0n_rev_intel

   CLAMP_INPLACE  edi, esi, ecx
   CLAMP_INPLACE  edi, esi, ecx
   CLAMP_INPLACE  edi, esi, ecx
   CLAMP_INPLACE  edi, esi, ecx
#   imul     edi, edi

#if 0
//#define clamp0n clamp0n_rev        // use the slow version
   CLAMP_COPY   eax, edi, esi
   and      eax, edi
   imul     edi, eax, 123
#endif

#if 0
   CLAMP_COPY  edi, eax, esi

   CLAMP_COPY  eax, edi, esi
   CLAMP_COPY  edi, eax, esi

   CLAMP_COPY  rax, rdi, rsi    # 64-bit for REX prefixes, keep dec/jnz off a 32-byte boundary so uop cache works (JCC erratum mitigation)
   CLAMP_COPY  rdi, rax, rsi
#endif

#nop  # pad the loop up to 32 total uops.  Tiny benefit on skylake in this artifical fully latency-bound case.
    dec ebp
    jnz .loop

    xor edi,edi
    mov eax,231   # __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       # sys_exit_group(0)

score 3 · Accepted Answer

一个常见的技巧（编译器使用它）是进行无符号比较：

    cmp rax, r12
    jb done
    ...
    ...
done:

在这里，如果rax是负数，当解释为无符号数时（通过jb，“如果低于”，则它看起来像一个大数（大于 2 ⁶³），因此无符号比较将两种“异常”情况（小于 0 和太大）。

如果异常情况非常罕见，则表示的代码的性能...无关紧要，通常情况下包含一个条件分支，通常采用。如果你想进一步改进它，你可以像这样重新排列代码：

    cmp rax, r12
    jb work_needed
done:
    (your code continued here)

work_needed:
    jl upper_limit_done
    lea rax, [r12 - 1]
upper_limit_done:
    test rax, rax
    jns lower_limit_done
    xor rax, rax
lower_limit_done:
    jmp done

在这里，通常的路径包含一个通常不采用的分支。这可能会提供一些小的改进，但会以较慢的例外情况为代价。

score 1 · Accepted Answer

编辑：正如 Peter Cordes 所指出的，CMOVxx 使事情变得更短。我在下面包含了我的原始答案，以表明它确实可以在没有 CMOVxx 的情况下完成。

它可以在没有分支的情况下使用条件移动来完成：

lea   rbx, [r12-1]
cmp   rax, r12
cmovge rax, rbx       ; clamp to r12-1 if it was too high
xor   rbx, rbx
test  rax, rax
cmovl rax, rbx        ; clamp to 0 if it's <= 0

或者优化为只比较一次，使用彼得答案中相同的有符号+无符号比较技巧：

xor    edx, edx
lea    rcx, [r12-1]
cmp    rax, r12
cmovae rax, rdx        ; clamp to 0 if unsigned >= r12
cmovge rax, rcx        ; clamp to r12-1 if signed >= r12, replacing 0. Else leave orig RAX

如果rax >=(unsigned) r12，则该值肯定需要被限制为 0 或 r12-1。

对于负 RAX 输入，只有ae条件为真，而不是ge，因此最终结果为 0。但对于过高的有符号正输入，ge将为真。这也没关系，ae第二个cmov将覆盖 0。否则（对于负 RAX），它将单独留下 0。

r12 已知为正数，因此 ge 为真意味着 ae 为真。

在具有 1 周期延迟cmovae/ ge（Broadwell 及更高版本，以及 AMD：https ://uops.info/ ）的 CPU 上，从 RAX 输入到 RAX 输出具有 3 周期延迟。（cmova在现代英特尔上读取 ZF 和 CF 是 2 uops，2 个周期延迟，但幸运的是我们不需要它。）

使用老式（1990 年代）技术的原始答案：

有一个没有任何分支的答案。它涉及更多一点，因为即使不使用它，也会计算每个可能的结果。我不知道它如何将性能方面与分支代码进行比较。但是没有分支惩罚。

避免分支的技巧是通过使用寄存器自身的减法借位将进位标志 (CF) 转换为位掩码。之后，您可以使用该寄存器作为和掩码来选择所需的结果。

; trick #1: rbx made 0 when rax negative, else 0xFFFFFFFFFFFFFFFF

mov rbx, rax
shl rbx, 1      ; negative => CF
cmc             ; CF 0 when rax negative
sbb rbx, rbx    ; this sets CF into all bits of rbx

; trick #2: rcx made 0xFFFFFFFFFFFFFFFF if less than r12, or else 0
; note that r12 should be positive for this to make sense

cmp rax, r12    ; CF 1 when rax < r12
sbb rcx, rcx    ; this sets CF into all bits of rcx

and rax, rcx    ; use rcx mask on rax

; calculate highest allowed value
mov rdx, r12
dec rdx
not rcx         ; invert rcx mask
and rdx, rcx    ; apply to replacement value

or rax, rdx
and rax, rbx

score 1 · Accepted Answer

我认为更少的跳跃，更干净：

xor  rdx, rdx
test rax, rax
js   OK
lea  rdx, [r12 - 1]
cmp  rax, r12
jge  OK
mov  rdx, rax
OK:
mov  rax, rdx

assembly - x86 程序集 - 将 rax 优化为 [0 .. 限制）

4 回答 4

Branchy：一个比较分支，一个 LEA + cmov。

无分支：1 个 CMP，2 个单 uop CMOV

如果钳位不常见，则完全需要它：

Related

Reference