是否存在允许将所有元素添加到通道中的内在函数?我正在使用 Neon 将 8 个数字相乘,我需要对结果求和。这是一些解释代码来显示我目前正在做的事情(这可能会被优化):
int16_t p[8], q[8], r[8];
int32_t sum;
int16x8_t pneon, qneon, result;
p[0] = some_number;
p[1] = some_other_number;
//etc etc
pneon = vld1q_s16(p);
q[0] = some_other_other_number;
q[1] = some_other_other_other_number;
//etc etc
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
vst1q_s16(r,result);
sum = ((int32_t) r[0] + (int32_t) r[1] + ... //etc );
有一个更好的方法吗?