c - _mm_crc32_u64 定义不明确

Question

为什么在世界上被_mm_crc32_u64(...)这样定义？

unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v );

“crc32”指令总是累积 32 位 CRC，而不是64 位 CRC（毕竟，CRC32 不是 CRC64）。如果机器指令 CRC32恰好有一个 64 位目标操作数，则高 32 位将被忽略，并在完成时用 0 填充，因此永远没有 64 位目标。我理解为什么英特尔允许在指令上使用 64 位目标操作数（为了统一），但如果我想快速处理数据，我想要一个尽可能大的源操作数（即，如果我有那么多数据，则为 64 位，尾端更小）并且始终是 32 位目标操作数。但是内在函数不允许 64 位源和 32 位目标。注意其他内在函数：

unsigned int _mm_crc32_u8 ( unsigned int crc, unsigned char v );

"crc" 的类型不是 8 位类型，返回类型也不是，它们是 32 位。为什么没有

unsigned int _mm_crc32_u64 ( unsigned int crc, unsigned __int64 v );

? 英特尔指令支持这一点，这是最有意义的内在。

有没有人有可移植的代码（Visual Studio 和 GCC）来实现后者？谢谢。 我的猜测是这样的：

#define CRC32(D32,S) __asm__("crc32 %0, %1" : "+xrm" (D32) : ">xrm" (S))

对于 GCC，以及

#define CRC32(D32,S) __asm { crc32 D32, S }

对于 VisualStudio。不幸的是，我对约束的工作原理知之甚少，对汇编级编程的语法和语义也没有什么经验。

小编辑：注意我定义的宏：

#define GET_INT64(P) *(reinterpret_cast<const uint64* &>(P))++
#define GET_INT32(P) *(reinterpret_cast<const uint32* &>(P))++
#define GET_INT16(P) *(reinterpret_cast<const uint16* &>(P))++
#define GET_INT8(P)  *(reinterpret_cast<const uint8 * &>(P))++


#define DO1_HW(CR,P) CR =  _mm_crc32_u8 (CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR =  _mm_crc32_u16(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR =  _mm_crc32_u32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = (_mm_crc32_u64((uint64)CR, GET_INT64(P))) & 0xFFFFFFFF;

注意最后一个宏语句有多么不同。缺乏统一性当然表明内在的定义没有得到合理的定义。虽然没有必要(uint64)在最后一个宏中进行显式转换，但它是隐式的并且确实会发生。反汇编生成的代码显示了 32->64 和 64->32 的代码，这两者都是不必要的。

换句话说，它是_mm_crc32_u64，不是 _mm_crc64_u64，但他们已经实现了它，就好像它是后者一样。

如果我能得到CRC32正确的上述定义，那么我想将我的宏更改为

#define DO1_HW(CR,P) CR = CRC32(CR, GET_INT8 (P))
#define DO2_HW(CR,P) CR = CRC32(CR, GET_INT16(P))
#define DO4_HW(CR,P) CR = CRC32(CR, GET_INT32(P))
#define DO8_HW(CR,P) CR = CRC32(CR, GET_INT64(P))

score 11 · Accepted Answer

提供的 4 个内在函数确实允许英特尔定义的 CRC32 指令的所有可能用途。指令输出始终为 32 位，因为指令被硬编码为使用特定的 32 位 CRC 多项式。但是，该指令允许您的代码一次向其提供 8、16、32 或 64 位的输入数据。一次处理 64 位应最大限度地提高吞吐量。如果仅限于 32 位构建，则一次处理 32 位是最好的。如果输入字节数是奇数或不是 4/8 的倍数，一次处理 8 或 16 位可以简化代码逻辑。

#include <stdio.h>
#include <stdint.h>
#include <intrin.h>

int main (int argc, char *argv [])
    {
    int index;
    uint8_t *data8;
    uint16_t *data16;
    uint32_t *data32;
    uint64_t *data64;
    uint32_t total1, total2, total3;
    uint64_t total4;
    uint64_t input [] = {0x1122334455667788, 0x1111222233334444};

    total1 = total2 = total3 = total4 = 0;
    data8  = (void *) input;
    data16 = (void *) input;
    data32 = (void *) input;
    data64 = (void *) input;

    for (index = 0; index < sizeof input / sizeof *data8; index++)
        total1 = _mm_crc32_u8 (total1, *data8++);

    for (index = 0; index < sizeof input / sizeof *data16; index++)
        total2 = _mm_crc32_u16 (total2, *data16++);

    for (index = 0; index < sizeof input / sizeof *data32; index++)
        total3 = _mm_crc32_u32 (total3, *data32++);

    for (index = 0; index < sizeof input / sizeof *data64; index++)
        total4 = _mm_crc32_u64 (total4, *data64++);

    printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
    printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
    printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
    printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
    return 0;
    }

score 2 · Accepted Answer

有没有人有可移植的代码（Visual Studio 和 GCC）来实现后者？谢谢。

我和我的朋友编写了一个 c++ sse 内在函数包装器，其中包含 crc32 指令与 64 位 src 的更优选用法。

http://code.google.com/p/sse-intrinsics/

请参阅 i_crc32() 指令。（遗憾的是，英特尔的 sse 内在规范在其他指令上存在更多缺陷，请参阅此页面以获取更多有缺陷的内在设计示例）

c - _mm_crc32_u64 定义不明确

2 回答 2

Related

Reference