我在搞乱 SSE,试图编写一个函数,将单精度浮点数组的所有值相加。我希望它适用于数组的所有长度,而不仅仅是 4 的倍数,就像网络上几乎所有示例中所假设的那样。我想出了这样的事情:
float sse_sum(const float *x, const size_t n)
{
const size_t
steps = n / 4,
rem = n % 4,
limit = steps * 4;
__m128
v, // vector of current values of x
sum = _mm_setzero_ps(0.0f); // sum accumulator
// perform the main part of the addition
size_t i;
for (i = 0; i < limit; i+=4)
{
v = _mm_load_ps(&x[i]);
sum = _mm_add_ps(sum, v);
}
// add the last 1 - 3 odd items if necessary, based on the remainder value
switch(rem)
{
case 0:
// nothing to do if no elements left
break;
case 1:
// put 1 remaining value into v, initialize remaining 3 to 0.0
v = _mm_load_ss(&x[i]);
sum = _mm_add_ps(sum, v);
break;
case 2:
// set all 4 to zero
v = _mm_setzero_ps();
// load remaining 2 values into lower part of v
v = _mm_loadl_pi(v, (const __m64 *)(&x[i]));
sum = _mm_add_ps(sum, v);
break;
case 3:
// put the last one of the remaining 3 values into v, initialize rest to 0.0
v = _mm_load_ss(&x[i+2]);
// copy lower part containing one 0.0 and the value into the higher part
v = _mm_movelh_ps(v,v);
// load remaining 2 of the 3 values into lower part, overwriting
// old contents
v = _mm_loadl_pi(v, (const __m64*)(&x[i]));
sum = _mm_add_ps(sum, v);
break;
}
// add up partial results
sum = _mm_hadd_ps(sum, sum);
sum = _mm_hadd_ps(sum, sum);
__declspec(align(16)) float ret;
/// and store the final value in a float variable
_mm_store_ss(&ret, sum);
return ret;
}
然后我开始怀疑这不是矫枉过正。我的意思是,我陷入了 SIMD 模式,只需要用 SSE 处理尾部。这很有趣,但是将尾部相加并使用常规浮点运算计算结果不是同样好(而且更简单)吗?我在 SSE 做这件事有什么收获吗?