performance - 将 BCD 打包到 DPD：如何改进这个 amd64 汇编程序？

Question

我正在编写一个例程来在BCD（每个十进制数字 4 位）和密集压缩十进制（DPD）（每 3 个十进制数字 10 位）之间进行转换。在Mike Cowlishaw 的网站上进一步记录了 DPD（建议软件使用查找表）。

这个例程只需要它使用的寄存器的低 16 位，但为了更短的指令编码，我尽可能使用 32 位指令。是与以下代码相关的速度惩罚：

mov data,%eax # high 16 bit of data are cleared
...
shl %al
shr %eax

或者

and $0x888,%edi         #   = 0000 a000 e000 i000
imul $0x0490,%di        #   = aei0 0000 0000 0000

其中 16 位的替代方案imul将是 32 位imul和后续and或一系列lea指令和最终and.

我的例程中的整个代码可以在下面找到。由于我混合了 word 和 dword 指令，其中有什么性能比它可能更差的地方吗？

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jz 1f

        and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x0490,%di        #   = aei0 0000 0000 0000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $13,%edi            # u = 0000 0000 0000 0aei
        imul tab-8(,%rdi,4),%si # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or tab-6(,%rdi,4),%ax   # x = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
1:      ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
tab:
        .short 0x0011 ; .short 0x000a
        .short 0x0000 ; .short 0x004e
        .short 0x0081 ; .short 0x000c
        .short 0x0008 ; .short 0x002e
        .short 0x0081 ; .short 0x000e
        .short 0x0000 ; .short 0x006e
        .size tab,.-tab

改进的代码

在应用了答案和评论中的一些建议以及其他一些技巧之后，这是我改进的代码。

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jnz 1f
        ret

        .align 8
1:      and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x49,%edi         #   = 0ae0 aei0 ei00 i000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $8,%edi             #   = 0000 0000 0ae0 aei0
        and $0xe,%edi           #   = 0000 0000 0000 aei0
        movzwl lookup-4(%rdi),%edx
        movzbl %dl,%edi
        imul %edi,%esi          # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or %dh,%al              #   = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
        ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
lookup:
        .byte 0x11
        .byte 0x0a
        .byte 0x00
        .byte 0x4e
        .byte 0x81
        .byte 0x0c
        .byte 0x08
        .byte 0x2e
        .byte 0x81
        .byte 0x0e
        .byte 0x00
        .byte 0x6e
        .size lookup,.-lookup

score 8 · Accepted Answer

TYVM 用于清楚地注释代码，顺便说一句。它使弄清楚发生了什么以及位在哪里变得非常容易。我以前从未听说过DPD，所以从未注释的代码和维基百科的文章中将它弄糊涂会很糟糕。

但是，您`imul`的直接操作数是有问题的。

在 Intel 上解码时会导致 LCP 停顿，它会写入寄存器的低 16 位，稍后以全宽读取。如果不屏蔽它的upper16将是一个问题（因为它被用作表索引）。它的操作数足够大，它们会将垃圾放入upper16，因此确实需要丢弃它。

我认为你的做法对于某些架构来说是最佳的，但事实证明imul r16,r16,imm16它本身比imul r32,r32,imm32除 VIA Nano、AMD K7（它比 imul32 快）和 Intel P6（从 32 位/64 位使用它）之外的所有架构都慢模式将 LCP 失速，并且部分注册减速是一个问题）。

在英特尔 SnB 系列 CPU 上，imul r16,r16,imm16两个微指令在哪里，imul32/movzx 会更好，除了代码大小之外没有任何缺点。在 P6 系列 CPU（即 PPro 到 Nehalem）上，imul r16,r16,imm16是一个 uop，但那些 CPU 没有 uop 缓存，所以 LCP 停顿可能很关键（除了 Nehalem 在一个紧密的循环中调用它，适合 28 uop循环缓冲区）。对于那些 CPU，movzx从部分注册失速的角度来看，显式可能更好。Agner Fog 说在 CPU 插入合并 uop 时有一个额外的循环，这可能意味着一个单独发出额外 uop 的循环。

在 AMD K8-Steamroller 上，imul imm16是 2 个 m-ops 而不是 1 个imul imm32，所以imul32/movzx大约等于imul16那里。他们不会遭受 LCP 停顿或部分注册问题的困扰。

在 Intel Silvermont 上，imul imm16为 2 微指令（每 4 个时钟吞吐量 1 个），imul imm32而 1 微指令（每 1 个时钟吞吐量 1 个）。在 Atom（Silvermont 的有序前身）上也是如此：imul16是一个额外的 uop，而且速度要慢得多。在大多数其他微架构上，吞吐量并不差，只是延迟。

因此，如果您愿意以字节为单位增加代码大小以提高速度，您应该使用 32bitimul和movzwl %di, %edi. imul imm16在某些架构上，这将与. 在 AMD 推土机系列上可能会稍差一些，显然，它不太擅长同时使用两个整数执行单元，因此 EX1 的 2 m-op 指令可能比两个 1 m-op 指令更好，其中一个它们仍然是仅 EX1 指令。如果您在乎，请对此进行基准测试。

tab至少对齐32B 边界，因此您的 32imul位or可以从其中的任何 2B 对齐条目进行 4B 加载，而不会越过缓存线边界。未对齐的访问对所有最近的 CPU（Nehalem 和更高版本，以及最近的 AMD）没有任何惩罚，只要它们不跨越两个高速缓存行。

使从表中读取的操作为 32 位避免了 Intel CPU 的部分寄存器损失。AMD CPU 和 Silvermont 不会单独跟踪部分寄存器，因此即使是只写到 low16 的指令也必须等待其余寄存器中的结果。这会阻止 16 位 insns 破坏依赖链。英特尔 P6 和 SnB 微架构系列跟踪部分 regs。Haswell 会做完整的双重簿记之类的，因为在需要合并时没有惩罚，比如在你转移 al 之后，然后转移 eax。SnB 将在此处插入一个额外的微指令，并且在执行此操作时可能会受到一两个周期的惩罚。我不确定，也没有测试过。但是，我没有看到避免这种情况的好方法。

shl %al可以替换为add %al, %al. 这可以在更多端口上运行。可能没有区别，因为 port0/5（或 Haswell 及更高版本上的 port0/6）可能没有饱和。它们对位具有相同的效果，但设置标志的方式不同。否则它们可以被解码为相同的微指令。

更改：将 pext/pdep / vectorize 版本拆分为单独的答案，部分原因是它可以拥有自己的评论线程。

score 4 · Accepted Answer

（我将 BMI2 版本拆分为单独的答案，因为它最终可能完全不同）

在看到你用它imul/shr来获取表索引之后，我可以看到你可以在哪里使用 BMI2pextr来替换and/imul/shr，或者 BMI1bextr来替换shr（允许使用 imul32，而不是 imul16，因为你只需提取位你想要的，而不是需要从 upper16 移零）。有 BMI1 的 AMD CPU，但即使是压路机也没有 BMI2。Intel与Haswell同时推出了BMI1和BMI2。

您可以使用 64 位 pextr 一次处理两个或四个 16 位字。但不适用于整个算法：您不能进行 4 次并行表查找。（AVX2 VPGATHERDD 在这里不值得使用。）实际上，您可以使用它pshufb来实现索引高达 4 位的 LUT，见下文。

小改进版本：

.section .rodata
  # This won't won't assemble, written this way for humans to line up with comments.

extmask_lobits:     .long           0b0000 0111 0111 0111
extmask_hibits:     .long           0b0000 1000 1000 1000

# pext doesn't have an immediate-operand form, but it can take the mask from a memory operand.
# Load these into regs if running in a tight loop.

#### TOTALLY UNTESTED #####
.text
.p2align 4,,10
bcd2dpd_bmi2:
#       mov   %edi,%eax           #   = 0000 abcd efgh iklm
#       shl   %al                 #   = 0000 abcd fghi klm0
#       shr   %eax                #   = 0000 0abc dfgh iklm

        pext  extmask_lobits, %edi, %eax
                                #   = 0000 0abc dfgh iklm
        mov   %eax, %esi        # insn scheduling for 4-issue front-end: Fast-path is 4 fused-domain uops
          # And doesn't waste issue capacity when we're taking the slow path.  CPUs with mov-elimination won't waste execution units from issuing an extra mov
        test  $0x880, %edi        # fast path for a = e = 0
        jnz .Lslow_path
        ret

.p2align 4
.Lslow_path:
        # 8 uops, including the `ret`: can issue in 2 clocks.

        # replaces and/imul/shr
        pext  extmask_hibits, %edi, %edi #u= 0000 0000 0000 0aei
        and   $0x66, %esi                # q = 0000 0000 0fg0 0kl0
        imul  tab-8(,%rdi,4), %esi       # v = q * tab[u-2][0]
        and   $0x397, %eax               # r = 0000 00bc d00h 0klm
        xor   %esi, %eax                 # w = r ^ v
        or    tab-6(,%rdi,4), %eax       # x = w | tab[u-2][1]
        and   $0x3ff, %eax               #   = 0000 00xx xxxx xxxx
        ret

当然，如果使它成为一个内联汇编，而不是一个独立的函数，你会改回快速路径分支到最后，而慢速路径则通过。而且您也不会在对齐填充中间功能上浪费空间。

对于函数的其余部分，使用 pextr 和/或 pdep 的范围可能更大。

我正在考虑如何使用 BMI2 做得更好。我认为我们可以aei从打包到 64b 中的四个短裤中获得多个选择器，然后使用pdep它们将它们存放在不同字节的低位中。然后movq到一个向量寄存器，在那里你用它作为一个洗牌控制掩码pshufb来做多个 4 位 LUT 查找。

所以我们一次可以从 60 个 BCD 位增加到 50 个 DPD 位。（用于shrd在寄存器之间移动位以处理对字节可寻址存储器的加载/存储。）

实际上，48 BCD 位（每组 4 组 12 位）-> 40 DPD 位可能要容易得多，因为您可以使用 pdep 将其解压缩为 64b 整数寄存器中的 4 组 16 位。处理 5 个组的选择器很好，您可以使用解包pmovzx，但处理其余数据需要在向量寄存器中进行位混洗。即使是缓慢的 AVX2 可变移位 insns 也无法轻松做到这一点。（尽管考虑如何使用 BMI2 实现这一点可能很有趣，但对于仅使用 SSSE3（即每个相关 CPU）或 SSE4.1 的 CPU 的大幅加速。）

这也意味着我们可以将 4 组的两个集群放入 128b 寄存器的低半部分和高半部分，以获得更高的并行性。

作为奖励，48bits 是一个整数字节，因此从 BCD 数字缓冲区读取不需要任何shrdinsns 将剩余的 4 位从最后 64b 获取到下一个低 4 位。或者当 4 个被忽略的位是 64b 的低 4 位或高 4 位时，两个偏移 pextr 掩码起作用。无论如何，我认为一次做 5 个组是不值得考虑的。

完整的 BMI2 / AVX pshufb LUT 版本（可矢量化）

数据移动可能是：

ignored | group 3        | group 2        | group 1        |  group 0
16bits  | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm

         3   2   1 | 0
pext -> aei|aei|aei|aei  # packed together in the low bits

          2  |      1            |        0
pdep ->  ... |0000 0000 0000 0aei|0000 0000 0000 0aei  # each in a separate 16b word

movq -> xmm vector register.
 (Then pinsrq another group of 4 selectors into the upper 64b of the vector reg).  So the vector part can handle 2 (or AVX2: 4) of this at once

vpshufb xmm2 -> map each byte to another byte (IMUL table)
vpshufb xmm3 -> map each byte to another byte (OR table)


Get the bits other than `aei` from each group of 3 BCD digits unpacked from 48b to 64b, into separate 16b words:

                  group 3       | group 2             | group 1             |  group 0
pdep(src)-> 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm

 movq this into a vector reg (xmm1).  (And repeat for the next 48b and pinsrq that to the upper64)

VPAND  xmm1, mask  (to zero aei in each group)

 Then use the vector-LUT results:
VPMULLW xmm1, xmm2 -> packed 16b multiply, keeping only the low16 of the result

VPAND   xmm1,  mask
VPXOR   xmm1, something
VPOR    xmm1, xmm3

movq / pextrq back to integer regs

pext to pack the bits back together
  You don't need the AND 0x3ff or equivalent:
  Those bits go away when you pext to pack each 16b down to 10b

shrd or something to pack the 40b results of this into 64b chunks for store to memory.
  Or: 32b store, then shift and store the last 8b, but that seems lame
  Or: just do 64b stores, overlapping with the previous.  So you write 24b of garbage every time.  Take care at the very end of the buffer.

使用 128b SSE 指令的 AVX 3 操作数版本，以避免需要movdqa不覆盖pshufb. 只要你从不运行 256b AVX 指令，你就不需要弄乱vzeroupper. v但是，如果您使用任何向量指令，您也可以使用所有向量指令的(VEX) 版本。在 VM 中，您可能在具有 BMI2 但不支持 AVX 的虚拟 CPU 上运行，因此很有可能。检查两个 CPU 功能标志仍然是一个好主意，而不是在看到 BMI2 时假设 AVX（即使这对于当前存在的所有物理硬件都是安全的）。

这开始看起来非常有效。即使您没有 BMI2 pext/pdep 来进行位打包/解包，也可能值得在向量 regs 中执行 mul/xor/ 和东西。我想您可以使用现有的非 BMI 标量路由之类的代码来获取选择器，并掩码/移位/或可以将非选择器数据构建成 16b 块。或者也许shrd是为了将数据从一个 reg 转移到另一个？

performance - 将 BCD 打包到 DPD：如何改进这个 amd64 汇编程序？

改进的代码

2 回答 2

但是，您imul的直接操作数是有问题的。

小改进版本：

完整的 BMI2 / AVX pshufb LUT 版本（可矢量化）

Related

Reference

但是，您`imul`的直接操作数是有问题的。