arm - 使用 ARM Neon 内在函数处理饱和整数乘法溢出的最有效方法是什么？

Question

我在 2 个 16 位向量之间有以下乘法：

int16x8_t dx;
int16x8_t dy;
int16x8_t dxdy = vmulq_s16(dx, dy);

如果dx和dy都足够大，结果会溢出。

我想钳制 MIN_INT16 和 MAX_INT16 值之间的乘积；

如果不先将值转换为 int32，我还没有找到一种方法。这就是我现在所做的：

int32x4_t dx_low4 = vmovl_s16(simde_vget_low_s16(dx)); // get lower 4 elements and widen
int32x4_t dx_high4 = vmovl_high_s16(dx); // widen higher 4 elements
int32x4_t dy_low4 = vmovl_s16(simde_vget_low_s16(dy)); // get lower 4 elements and widen
int32x4_t dy_high4 = vmovl_high_s16(dy); // widen higher 4 elements
    
int32x4_t dxdy_low = vmulq_s32(dx_low4, dy_low4);
int32x4_t dxdy_high = vmulq_s32(dx_high4, dy_high4);
// combine and handle saturation:    
int16x8_t dxdy = vcombine_s16(vqmovn_s32(dxdy_low), vqmovn_s32(dxdy_high));

有没有办法更有效地实现这一目标？

score 4 · Accepted Answer

这是另一个版本。它与您的代码几乎相同，但使用较少的指令，例如 NEON 具有扩展乘法。我不确定它是更快还是更慢（显然互联网上任何地方都没有可搜索的 NEON 指令时间）。未经测试，但代码看起来不错，4 条指令。

inline int16x8_t saturatingMultiply( int16x8_t dx, int16x8_t dy )
{
    // Multiply + widen lower 4 lanes; vget_low_s16 is free, compiles into no instructions
    const int32x4_t low32 = vmull_s16( vget_low_s16( dx ), vget_low_s16( dy ) );
    // Multiply + widen higher 4 lanes
    const int32x4_t high32 = vmull_high_s16( dx, dy );
    // Saturate + narrow lower 4 lanes
    const int16x4_t low16 = vqmovn_s32( low32 );
    // Saturate + narrow remaining 4 lanes, moving into the higher lanes of the result
    return vqmovn_high_s32( low16, high32 );
}

如果您要为 ARMv7 而不是 ARM64 编译它，则需要对缺少的内在函数进行一些更改、使用vget_high_s16和解决方法。vcombine_s16

arm - 使用 ARM Neon 内在函数处理饱和整数乘法溢出的最有效方法是什么？

1 回答 1

Related

Reference