optimization - ARM NEON 汇编和浮点舍入

Question

我正在使用 NEON 对 ARM 处理器进行代码优化。但是我有一个问题：我的算法包含以下浮点计算：

round(x*b - y*a)

结果可以是正面的也可以是负面的。

实际上，我使用 2 个 VMUL 和 1 个 VSUB 进行并行计算（每次操作使用 Q 寄存器和 32 位浮点数 4 个值）。

有办法解决这个问题吗？如果结果都是相同的符号，我知道我可以简单地加或减 0.5

score 2 · Accepted Answer

我对汇编一无所知，但是使用 C 中的 NEON 内在函数（我提到了它们的汇编等效项以帮助您浏览文档，即使我自己无法使用它们），round函数的算法可能是：

// Prepare 3 vectors filled with all 0.5, all -0.5, and all 0
// Corresponding assembly instruction is VDUP
float32x4_t plus  = vdupq_n_f32(0.5);
float32x4_t minus = vdupq_n_f32(-0.5);
float32x4_t zero  = vdupq_n_f32(0);

// Assuming the result of x*a-y*b is stored in the following vector:
float32x4_t xa_yb;

// Compare vector with 0
// Corresponding assembly instruction is VCGT
uint32x4_t more_than_zero = vcgtq_f32(xa_yb, zero);
// Resulting vector will be set to all 1-bits for values where the comparison
// is true, all 0-bits otherwise.

// Use bit select to choose if you have to add or substract 0.5
// Corresponding assembly instruction is VBSL, its syntax is quite alike
// `more_than_zero ? plus : minus`.
float32x4_t to_add = vbslq_f32(more_than_zero, plus, minus);

// Add this vector to the vector to round
// Corresponding assembly instruction is VADD,
// but I guess you knew this one :D
float32x4_t rounded = vaddq_f32(xa_yb, to_add);

// Then cast to integers!

我想您可以将其转换为程序集（无论如何，我不是）

请注意，我不知道这是否真的比标准代码、非 SIMD 代码更有效！

score 2 · Accepted Answer

首先，NEON 的延迟很长，尤其是在浮点乘法之后。因此，与 vfp 编程相比，使用两个 vmul 和一个 vsub 不会获得太多收益。

因此，您的代码应如下所示：

vmul.f32 result, x, b
vmls.f32 result, y, a

这些乘法累加/减法指令与先前的乘法指令背靠背发出，没有任何延迟。（在这种情况下节省了 9 个周期）

然而不幸的是，我不明白你的实际问题。为什么有人要舍入浮点值？显然你打算提取整数部分，并且有几种方法可以做到这一点，我不能告诉你更多，因为你的问题总是太模糊。

我在这个论坛上关注你的问题已经有一段时间了，我根本无法摆脱你缺乏一些非常基本的东西的感觉。

我建议你先阅读 ARM 的汇编参考指南 pdf。

optimization - ARM NEON 汇编和浮点舍入

2 回答 2

Related

Reference