performance - 使用英特尔内在函数的算术移位

Question

我有一组位，例如： 1000 0000 0000 0000 这是 16 位，因此很短。我想使用算术移位，以便使用 MSB 分配其余位：

1111 1111 1111 1111

如果我从 0000 0000 0000 0000 开始：

算术移位后，我仍然会有： 0000 0000 0000 0000

我怎么能在不诉诸组装的情况下做到这一点？我查看了英特尔内在函数指南，看起来我需要使用 AVX 扩展来执行此操作，但他们查看的数据类型比我的短。

score 3 · Accepted Answer

正如 mattnewport 在他的回答中所说，C 代码可以有效地完成这项工作，尽管可能存在“实现定义的”行为。这个答案显示了如何避免实现定义的行为，同时保持高效的代码生成。

因为问题是关于移位 16 位操作数，所以可以通过将操作数第一个符号扩展到 32 位来避免对实现定义的决定是符号扩展还是零填充的担忧。然后可以将 32 位值右移为无符号，最后截断回 16 位。

Mattnewport 的代码实际上在移位之前将 16 位操作数符号扩展为 int（32 位或 64 位，具体取决于编译器模型）。这是因为语言规范（C99 6.5.7 位移位运算符）需要第一步：对每个操作数执行整数提升。同样，mattnewport 代码的结果是 int 因为结果的类型是提升的左操作数的类型。因此，避免实现定义行为的代码变体会生成与 mattnewport 的原始代码相同数量的指令。

为了避免实现定义的行为，对有符号整数的隐式提升被替换为对无符号整数的显式提升。这消除了实现定义行为的任何可能性，同时保持相同的代码效率。

这个想法可以扩展到涵盖 32 位操作数，并在存在 64 位本机整数支持时有效地工作。这是一个例子：

// use standard 'll' for long long print format
#define __USE_MINGW_ANSI_STDIO 1
#include <stdio.h>
#include <stdint.h>

// Code provided by mattnewport
int16_t aShiftRight16x (int16_t val, int count)
    {
    return val >> count;
    }

// This variation avoids implementation defined behavior
int16_t aShiftRight16y (int16_t val, int count)
    {
    uint32_t uintVal = val;
    uint32_t uintResult = uintVal >> count;
    return (int16_t) uintResult;
    }

// A 32-bit arithmetic right shift without implementation defined behavior
int32_t aShiftRight32 (int32_t val, int count)
    {
    uint64_t uint64Val = val;
    uint64_t uint64Result = uint64Val >> count;
    return (int32_t) uint64Result;
    }

int main (void)
    {
    int16_t val16 = 0x8000;
    int32_t val32 = 0x80000000;
    int count;

    for (count = 0; count <= 15; count++)
        printf ("%04hX %04hX %08X\n", aShiftRight16x (val16, count),
                                      aShiftRight16y (val16, count),
                                      aShiftRight32  (val32, count));
    return 0;
    }

这是 gcc 4.8.1 x64 代码生成：

  0000000000000030 <aShiftRight16x>:
    30: 0f bf c1                movsx  eax,cx
    33: 89 d1                   mov    ecx,edx
    35: d3 f8                   sar    eax,cl
    37: c3                      ret    

  0000000000000040 <aShiftRight16y>:
    40: 0f bf c1                movsx  eax,cx
    43: 89 d1                   mov    ecx,edx
    45: d3 e8                   shr    eax,cl
    47: c3                      ret    

  0000000000000050 <aShiftRight32>:
    50: 48 63 c1                movsxd rax,ecx
    53: 89 d1                   mov    ecx,edx
    55: 48 d3 e8                shr    rax,cl
    58: c3                      ret

这是 MS Visual Studio x64 代码生成：

  aShiftRight16x:
    00: 0F BF C1           movsx       eax,cx
    03: 8B CA              mov         ecx,edx
    05: D3 F8              sar         eax,cl
    07: C3                 ret
  aShiftRight16y:
    10: 0F BF C1           movsx       eax,cx
    13: 8B CA              mov         ecx,edx
    15: D3 E8              shr         eax,cl
    17: C3                 ret
  aShiftRight32:
    20: 48 63 C1           movsxd      rax,ecx
    23: 8B CA              mov         ecx,edx
    25: 48 D3 E8           shr         rax,cl
    28: C3                 ret

程序输出：

8000 8000 80000000
C000 C000 C0000000
E000 E000 E0000000
F000 F000 F0000000
F800 F800 F8000000
FC00 FC00 FC000000
FE00 FE00 FE000000
FF00 FF00 FF000000
FF80 FF80 FF800000
FFC0 FFC0 FFC00000
FFE0 FFE0 FFE00000
FFF0 FFF0 FFF00000
FFF8 FFF8 FFF80000
FFFC FFFC FFFC0000
FFFE FFFE FFFE0000
FFFF FFFF FFFF0000

score 2 · Accepted Answer

我不确定你为什么要为此寻找内在函数。为什么不直接使用普通的 C++ 右移？此行为是实现定义的，但英特尔平台上的 AFAIK 将始终签名扩展。

int16_t val = 1 << 15; // 1000 0000 0000 0000 
int16_t shiftVal = val >> 15; // 1111 1111 1111 1111

performance - 使用英特尔内在函数的算术移位

2 回答 2

Related

Reference