c++11 - 基于布尔掩码将元素移动到 SIMD 寄存器的左侧

Question

这个问题与此有关：Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector

我想用这个签名创建一个最佳函数：

__m256i PackLeft(__m256i inputVector, __m256i boolVector);

所需的行为是在 64 位 int 的输入上，如下所示：

输入向量 = {42, 17, 13, 3}

boolVector = {真，假，真，假}

false它屏蔽了中的所有值，boolVector然后重新打包保留在左侧的值。在上面的输出中，返回值应该是：

{42, 13, X, X}

... X 是“我不在乎”。

一个明显的方法是使用_mm_movemask_epi8从 bool 向量中获取 8 字节 int，在表中查找 shuffle 掩码，然后使用掩码进行 shuffle。

但是，如果可能的话，我想避免使用查找表。有更快的解决方案吗？

score 0 · Accepted Answer

Andreas Fredriksson 在他的 2015 GDC 演讲中很好地涵盖了这一点：https ://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

从幻灯片 104 开始，他介绍了如何仅使用 SSSE3，然后仅使用 SSE2。

score -1 · Accepted Answer

刚刚看到这个问题 - 也许你已经解决了它，但我仍在为可能需要处理这种情况的其他程序员编写逻辑。

下面给出了解决方案（采用 Intel ASM 格式）。它包括三个步骤：

步骤0：将8位掩码转换为64位掩码，原始掩码中的每个设置位表示为扩展掩码中的8位设置。

第 1 步：使用此扩展掩码从源数据中提取相关位

第 2 步：由于您需要将数据保持打包，因此我们将输出移动适当的位数。

代码如下：

; Step 0 : convert the 8 bit mask into a 64 bit mask
    xor     r8,r8
    movzx   rax,byte ptr mask_pattern
    mov     r9,rax  ; save a copy of the mask - avoids a memory read in Step 2
    mov     rcx,8   ; size of mask in bit count
outer_loop :
    shr     al,1    ; get the least significant bit of the mask into CY
    setnc   dl      ; set DL to 0 if CY=1, else 1
    dec dl      ; if mask lsb was 1, then DL is 1111, else it sets to 0000
    shrd    r8,rdx,8
    loop    outer_loop
; We get the mask duplicated in R8, except it now represents bytewise mask
; Step 1 : we extract the bits compressed to the lowest order bit
    mov     rax,qword ptr data_pattern
    pext    rax,rax,r8
; Now we do a right shift, as right aligned output is required
    popcnt  r9,r9   ; get the count of bits set in the mask
    mov     rcx,8
    sub     cl,r9b  ; compute 8-(count of bits set to 1 in the mask)
    shl     cl,3    ; convert the count of bits to count of bytes
    shl     rax,cl
;The required data is in RAX

相信这会有所帮助

c++11 - 基于布尔掩码将元素移动到 SIMD 寄存器的左侧

2 回答 2

Related

Reference