4

我正在用 x86-64 汇编语言编写一个代码库,为s0128s0256s0512s1024s2048和有s4096符号整数类型以及f0128f0256f0512f1024f2048f4096浮点数提供所有常规的按位、移位、逻辑、比较、算术和数学函数类型。

现在我正在编写一些类型转换例程,并且遇到了一些本应微不足道但需要的指令比我预期的要多得多的东西。我觉得我必须遗漏一些东西(一些说明)以使这更容易,但到目前为止还没有运气。

结果的低 128 位s0256只是s0128输入参数的副本,结果的高 128 位中的所有位都s0256必须设置为s0128输入参数中的最高有效位。

很简单吧?但这是迄今为止我能想到的最好的转换s0256s0128. 忽略前 4 行(它们只是参数错误检查)和最后 2 行(从没有错误的函数返回 (rax == 0))。中间的 5 行是有问题的算法。尽量避免[条件]跳转指令。

.text
.align 64
big_m63:
.quad  -63, -63                       # two shift counts for vpshaq instruction

big_s0256_eq_s0128:    # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
  orq        %rdi, %rdi               # is arg0 a valid address ???
  jz         error_argument_invalid   # nope
  orq        %rsi, %rsi               # is arg1 a valid address ???
  jz         error_argument_invalid   # nope

  vmovapd    (%rsi), %xmm0            # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
  vmovhlps   %xmm0, %xmm0, %xmm1      # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
  vpshaq     big_m63, %xmm1, %xmm1    # ymm1 = arg1.sign : arg1.sign : 0 : 0
  vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
  vmovapd    %ymm0, (%rdi)            # arg0 = arg1 (sign-extended to 256-bits)

  xorq       %rax, %rax               # rax = 0 == no error
  ret                                 # return from function

该例程也不是最优的,因为每条指令都需要前一条指令的结果,这会阻止任何指令的并行执行。

有没有更好的使用符号扩展来右移的指令?我找不到这样的指令vpshaq接受立即字节来指定移位计数,尽管我不知道为什么(许多 SIMD 指令具有用于各种目的的立即 8 位操作数)。此外,英特尔不支持vpshaq. 哎呀!

但看!StephenCanon 在下面为这个问题提供了一个绝妙的解决方案!惊人的!该解决方案的指令比上述多一条,但该vpxor指令可以放在第一vmovapd条指令之后,并且应该不会比上面的 5 条指令版本有效地占用更多的周期。太棒了!

为了完整和方便比较,这里是最新的 StephenCanon 增强的代码:

.text
.align 64
big_s0256_eq_s0128:    # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
  orq        %rdi, %rdi               # is arg0 a valid address ???
  jz         error_argument_invalid   # nope
  orq        %rsi, %rsi               # is arg1 a valid address ???
  jz         error_argument_invalid   # nope

  vmovapd    (%rsi), %xmm0            # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
  vpxor      %xmm2, %xmm2, %xmm2      # ymm2 = 0 : 0 : 0 : 0
  vmovhlps   %xmm0, %xmm0, %xmm1      # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
  vpcmpgtq   %xmm1, %xmm2, %xmm1      # ymm1 = arg1.sign : arg1.sign : 0 : 0
  vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
  vmovapd    %ymm0, (%rdi)            # arg0 = arg1 (sign-extended to 256-bits)

  xorq       %rax, %rax               # rax = 0 == no error
  ret                                 # return from function

我不确定,但不需要从内存中读取这两个 64 位移位计数也可能会稍微加快代码速度。好的。

4

1 回答 1

4

You're over-complicating things. Once you have the sign in rax, just do two 64b stores from there instead of trying to assemble the result in ymm0. One less instruction and a much shorter dependency chain.

As the destination type gets larger, of course, it makes sense to use wider stores (AVX). With AVX2 you can use vbroadcastq to do the splat more efficiently, but it looks like you're targeting baseline AVX?

I should also note that once you get to ~512b integers, for most algorithms the cost of super-linear operations like multiplication so completely dominates the running time that squeezing every last cycle out of operations like sign extension rapidly starts to lose value. It's a good exercise, but ultimately not the most productive use of your time once your implementations are "good enough”.


After further thought, I have the following suggestion:

vmovhlps  %xmm0, %xmm0, %xmm1 // could use a permute instead to stay in integer domain.
vpxor     %xmm2, %xmm2, %xmm2
vpcmpgtq  %xmm1, %xmm2, %xmm2 // generate sign-extension without shift

This has the virtues of (a) not requiring a constant load and (b) working on both Intel and AMD. The xor to generate zero looks like an extra instruction, but in practice this zeroing idiom doesn’t even require an execute slot on recent processors.


FWIW, if targeting AVX2, I might write it like this:

vmovdqa (%rsi),        %xmm0 // { x0, x1, 0,  0  }
vpermq   $0x5f, %ymm0, %ymm1 // { 0,  0,  x1, x1 }
vpxor    %ymm2, %ymm2, %ymm2 // { 0,  0,  0,  0  }
vpcmpgtq %ymm1, %ymm2, %ymm2 // { 0,  0,  s,  s  } s = sign extension
vpor     %ymm2, %ymm0, %ymm0 // { x0, x1, s,  s  }
vmovdqa  %ymm0,       (%rdi)

Unfortunately, I don’t think that vpermq is available on AMD.

于 2014-01-12T16:44:03.857 回答