assembly - 使用 AVX-512 或 AVX-2 对大数据计数 1 位（人口计数）

Question

我有一大块内存，比如 256 KiB 或更长。我想计算整个块中 1 的位数，或者换句话说：将所有字节的“人口计数”值相加。

我知道 AVX-512 有一条VPOPCNTDQ 指令，它计算 512 位向量中每个连续 64 位中 1 的位数，而 IIANM 应该可以在每个周期发出其中一个（如果适当的 SIMD 向量寄存器是可用） - 但我没有任何编写 SIMD 代码的经验（我更像是一个 GPU 人）。此外，我不是 100% 确定编译器对 AVX-512 目标的支持。

在大多数 CPU 上，仍然不（完全）支持 AVX-512；但 AVX-2 已广泛使用。我无法找到类似于 VPOPCNTDQ 的小于 512 位的矢量化指令，所以即使理论上我也不确定如何使用支持 AVX-2 的 CPU 快速计算位数；也许存在这样的东西，而我只是以某种方式错过了它？

无论如何，我很欣赏一个简短的 C/C++ 函数 - 使用一些内部包装库或内联汇编 - 对于两个指令集中的每一个。签名是

uint64_t count_bits(void* ptr, size_t size);

笔记：

与如何在 Sandy Bridge 上的一系列 int 中快速将位计数到单独的 bin 中？，但不是骗子。
如果这很重要，我们可以假设输入是对齐的。
忘记多个内核或套接字，我想要单个（单个线程上的）内核的代码。

score 6 · Accepted Answer

AVX-2

@HadiBreis 的评论链接到 Wojciech Muła 撰写的关于 SSSE3 快速人口计数的文章；文章链接到这个 GitHub 存储库；并且存储库具有以下 AVX-2 实现。它基于矢量化查找指令，并使用 16 值查找表来计算半字节的位数。

#   include <immintrin.h>
#   include <x86intrin.h>

std::uint64_t popcnt_AVX2_lookup(const uint8_t* data, const size_t n) {

    size_t i = 0;

    const __m256i lookup = _mm256_setr_epi8(
        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4,

        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
    );

    const __m256i low_mask = _mm256_set1_epi8(0x0f);

    __m256i acc = _mm256_setzero_si256();

#define ITER { \
        const __m256i vec = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i)); \
        const __m256i lo  = _mm256_and_si256(vec, low_mask); \
        const __m256i hi  = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \
        const __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo); \
        const __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi); \
        local = _mm256_add_epi8(local, popcnt1); \
        local = _mm256_add_epi8(local, popcnt2); \
        i += 32; \
    }

    while (i + 8*32 <= n) {
        __m256i local = _mm256_setzero_si256();
        ITER ITER ITER ITER
        ITER ITER ITER ITER
        acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));
    }

    __m256i local = _mm256_setzero_si256();

    while (i + 32 <= n) {
        ITER;
    }

    acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));

#undef ITER

    uint64_t result = 0;

    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 0));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 1));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 2));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 3));

    for (/**/; i < n; i++) {
        result += lookup8bit[data[i]];
    }

    return result;
}

AVX-512

同一存储库还具有基于 VPOPCNT 的 AVX-512 实现。在列出它的代码之前，这里是简化且更易读的伪代码：

对于每个 64 字节的连续序列：

将序列加载到 64x8 = 512 位的 SIMD 寄存器中

在该寄存器上执行 8 个 64 位的并行填充计数

将 8 个人口计数结果并行添加到包含 8 个和的“累加器”寄存器中

将累加器中的 8 个值相加

如果尾部小于 64 字节，请以更简单的方式计算其中的位

返回主和加上尾和

现在是真正的交易：

#   include <immintrin.h>
#   include <x86intrin.h>

uint64_t avx512_vpopcnt(const uint8_t* data, const size_t size) {
    
    const size_t chunks = size / 64;

    uint8_t* ptr = const_cast<uint8_t*>(data);
    const uint8_t* end = ptr + size;

    // count using AVX512 registers
    __m512i accumulator = _mm512_setzero_si512();
    for (size_t i=0; i < chunks; i++, ptr += 64) {
        
        // Note: a short chain of dependencies, likely unrolling will be needed.
        const __m512i v = _mm512_loadu_si512((const __m512i*)ptr);
        const __m512i p = _mm512_popcnt_epi64(v);

        accumulator = _mm512_add_epi64(accumulator, p);
    }

    // horizontal sum of a register
    uint64_t tmp[8] __attribute__((aligned(64)));
    _mm512_store_si512((__m512i*)tmp, accumulator);

    uint64_t total = 0;
    for (size_t i=0; i < 8; i++) {
        total += tmp[i];
    }

    // popcount the tail
    while (ptr + 8 < end) {
        total += _mm_popcnt_u64(*reinterpret_cast<const uint64_t*>(ptr));
        ptr += 8;
    }

    while (ptr < end) {
        total += lookup8bit[*ptr++];
    }

    return total;
}

这lookup8bit是一个用于字节而不是位的 popcnt 查找表，并在此处定义。编辑：正如评论者所指出的，最后使用 8 位查找表不是一个好主意，可以改进。

score 3 · Accepted Answer

Wojciech Muła 的大数组 popcnt 函数看起来是最优的，除了标量清理循环。（有关主循环的详细信息，请参阅@einpoklum 的答案）。

最后只使用几次的 256 条目 LUT 可能会缓存未命中，即使缓存很热，也不是超过 1 个字节的最佳选择。我相信所有的 AVX2 CPU 都有硬件popcnt，我们可以轻松地隔离最后最多 8 个尚未计算的字节，以便为我们设置单个popcnt.

与 SIMD 算法一样，在缓冲区的最后一个字节处进行全角加载通常效果很好。但与向量寄存器不同的是，全整数寄存器的可变计数移位很便宜（尤其是 BMI2）。Popcnt 不关心位在哪里，所以我们可以只使用移位而不需要构造 AND 掩码或其他任何东西。

// untested
// ptr points at the first byte that hasn't been counted yet
uint64_t final_bytes = reinterpret_cast<const uint64_t*>(end)[-1] >> (8*(end-ptr));
total += _mm_popcnt_u64( final_bytes );
// Careful, this could read outside a small buffer.

或者更好的是，使用更复杂的逻辑来避免页面交叉。例如，这可以避免页面开始处的 6 字节缓冲区的页面交叉。

assembly - 使用 AVX-512 或 AVX-2 对大数据计数 1 位（人口计数）

2 回答 2

AVX-2

AVX-512

Related

Reference