sum - 使用 AVX 一次性完成 4 个水平双精度求和

Question

问题可以描述如下。

输入

__m256d a, b, c, d

输出

__m256d s = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
             c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

到目前为止我所做的工作

这似乎很容易：两个 VHADD 之间有一些改组，但实际上结合 AVX 的所有排列不能生成实现该目标所需的排列。让我解释：

VHADD x, a, b => x = {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
VHADD y, c, d => y = {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}

我是否能够以相同的方式排列 x 和 y 以获得

x1 = {a[0]+a[1], a[2]+a[3], c[0]+c[1], c[2]+c[3]}
y1 = {b[0]+b[1], b[2]+b[3], d[0]+d[1], d[2]+d[3]}

然后

VHADD s, x1, y1 => s1 = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
                         c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

这就是我想要的结果。

因此我只需要找到如何执行

x,y => {x[0], x[2], y[0], y[2]}, {x[1], x[3], y[1], y[3]}

不幸的是，我得出的结论是，使用 VSHUFPD、VBLENDPD、VPERMILPD、VPERM2F128、VUNPCKHPD、VUNPCKLPD 的任何组合都证明是不可能的。问题的关键在于，在 __m256d 的实例 u 中交换 u[1] 和 u[2] 是不可能的。

问题

这真的是死胡同吗？还是我错过了排列指令？

score 6 · Accepted Answer

VHADD指令意味着要遵循常规VADD。下面的代码应该给你你想要的：

// {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
__m256d sumab = _mm256_hadd_pd(a, b);
// {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
__m256d sumcd = _mm256_hadd_pd(c, d);

// {a[0]+a[1], b[0]+b[1], c[2]+c[3], d[2]+d[3]}
__m256d blend = _mm256_blend_pd(sumab, sumcd, 0b1100);
// {a[2]+a[3], b[2]+b[3], c[0]+c[1], d[0]+d[1]}
__m256d perm = _mm256_permute2f128_pd(sumab, sumcd, 0x21);

__m256d sum =  _mm256_add_pd(perm, blend);

这给出了 5 条指令的结果。我希望我得到了正确的常数。

您提出的排列当然可以完成，但需要多条指令。对不起，我没有回答你问题的那一部分。

编辑：我无法抗拒，这是完整的排列。（再一次，我尽了最大努力使常量正确。）你可以看到交换是可能的u[1]，u[2]只是需要一些工作。在第一代中跨越 128 位的障碍是很困难的。AVX。我还想说这比它VADD更可取，VHADD因为VADD它具有两倍的吞吐量，即使它执行相同数量的添加。

// {x[0],x[1],x[2],x[3]}
__m256d x;

// {x[1],x[0],x[3],x[2]}
__m256d xswap = _mm256_permute_pd(x, 0b0101);

// {x[3],x[2],x[1],x[0]}
__m256d xflip128 = _mm256_permute2f128_pd(xswap, xswap, 0x01);

// {x[0],x[2],x[1],x[3]} -- not imposssible to swap x[1] and x[2]
__m256d xblend = _mm256_blend_pd(x, xflip128, 0b0110);

// repeat the same for y
// {y[0],y[2],y[1],y[3]}
__m256d yblend;

// {x[0],x[2],y[0],y[2]}
__m256d x02y02 = _mm256_permute2f128_pd(xblend, yblend, 0x20);

// {x[1],x[3],y[1],y[3]}
__m256d x13y13 = _mm256_permute2f128_pd(xblend, yblend, 0x31);

score 0 · Accepted Answer

我不知道有任何指令可以让您进行这种排列。AVX 指令的操作通常使得寄存器的高 128 位和低 128 位有些独立；将两半的值混合的能力并不多。我能想到的最佳实现将基于这个问题的答案：

__m128d horizontal_add_pd(__m256d x1, __m256d x2)
{
    // calculate 4 two-element horizontal sums:
    // lower 64 bits contain x1[0] + x1[1]
    // next 64 bits contain x2[0] + x1[1]
    // next 64 bits contain x1[2] + x1[3]
    // next 64 bits contain x2[2] + x2[3]
    __m256d sum = _mm256_hadd_pd(x1, x2);
    // extract upper 128 bits of result
    __m128d sum_high = _mm256_extractf128_pd(sum1, 1);
    // add upper 128 bits of sum to its lower 128 bits
    __m128d result = _mm_add_pd(sum_high, (__m128d) sum);
    // lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
    // upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]
    return result;
}

__m256d a, b, c, d;
__m128d res1 = horizontal_add_pd(a, b);
__m128d res2 = horizontal_add_pd(c, d);
// At this point:
//     res1 contains a's horizontal sum in bits 0-63
//     res1 contains b's horizontal sum in bits 64-127
//     res2 contains c's horizontal sum in bits 0-63
//     res2 contains d's horizontal sum in bits 64-127
// cast res1 to a __m256d, then insert res2 into the upper 128 bits of the result
__m256d sum = _mm256_insertf128_pd(_mm256_castpd128_pd256(res1), res2, 1);
// At this point:
//     sum contains a's horizontal sum in bits 0-63
//     sum contains b's horizontal sum in bits 64-127
//     sum contains c's horizontal sum in bits 128-191
//     sum contains d's horizontal sum in bits 192-255

这应该是你想要的。以上应该在 7 条总指令中是可行的（强制转换不应该真正做任何事情；它只是给编译器的一个注释，以改变它处理中的值的方式res1），假设horizontal_add_pd()你的编译器可以内联短函数并且你有足够的可用寄存器。

sum - 使用 AVX 一次性完成 4 个水平双精度求和

2 回答 2

Related

Reference