c - 将代码转换为 Neon 程序集

Question

我正在将下面的代码翻译成 Neon Assembly。任何帮助将不胜感激。

void sum(int length, int *a, int *b, int *c, int *d, char *result)
{
   int i;

   for (i = 0; i < length; i++)
      {
          int sum = (a[i] + b[i] + c[i] + d[i])/4;
          if (sum > threshold)
             result[i] = 1;
          else
             result[i] = 0;
      }
}

实际代码是图像二值化算法。上面的代码只是为了演示这个想法，而不是让简单的事情变得更复杂。

score 2 · Accepted Answer

这是一个相当简单的实现。请注意，我们将划分和阈值测试转换为仅针对的测试threshold * 4（以消除划分）：

void sum(const int n, const int32_t *a, const int32_t *b, const int32_t *c, const int32_t *d, int32_t *result)
{
   const int32_t threshold4 = threshold * 4;
   const int32x4_t vthreshold4 = { threshold4, threshold4, threshold4, threshold4 };
   const uint32x4_t vk1 = { 1, 1, 1, 1 };
   int i;

   for (i = 0; i < n; i += 4)
   {
      int32x4_t va = vld1q_s32(&a[i]);    // load values from a, b, c, d
      int32x4_t vb = vld1q_s32(&b[i]);
      int32x4_t vc = vld1q_s32(&c[i]);
      int32x4_t vd = vld1q_s32(&d[i]);

      int32x4_t vsum = vaddq_s32(va, vb); // sum values form a, b, c, d
      vsum = vaddq_s32(vsum, vc);
      vsum = vaddq_s32(vsum, vd);

      uint32x4_t vcmp = vcgtq_s32(vsum, vthreshold4);
                                          // compare with threshold * 4
      int32x4_t vresult = (int32x4_t)vandq_u32(vcmp, vk1);
                                          // convert result to 0/1
      vst1q_s32(&result[i], vresult);     // store result
   }
}

笔记：

完全未经测试的代码 - 可能需要进一步的工作
result已更改为int32_t *- 打包起来并不难，uint8_t但它为这个初始示例增加了很多复杂性，所以我想我现在会保持简单
a, b, c, d,result都需要16字节对齐
n必须是 4 的倍数
a, b, c,的总和d需要适合 32 位有符号整数
threshold * 4需要适合 32 位有符号整数

c - 将代码转换为 Neon 程序集

1 回答 1

Related

Reference