我还没有找到任何关于这个主题的明确基准,所以我做了一个。我会把它贴在这里,以防有人像我一样寻找这个。
我有一个问题。SSE 不应该比循环中的四个 fpu RSQRT 快 4 倍吗?它更快,但只有 1.5 倍。迁移到 SSE 寄存器是否会产生如此大的影响,因为我不进行大量计算,而只进行 rsqrt?还是因为 SSE rsqrt 更精确,我如何找到 sse rsqrt 的迭代次数?两个结果:
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07
编辑
/GS- /Gy /fp:fast /arch:SSE2 /Ox /Oy- /GL /Oi
在 AMD Athlon II X2 270 上使用 MSVC 11 编译
测试代码:
#include <iostream>
#include <chrono>
#include <th/thutility.h>
int main(void)
{
float i;
//long i;
float res;
__declspec(align(16)) float var[4] = {0};
auto t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
res = sqrt(i);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 float SQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, res);
res *= i;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 float RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
thutility::math::rsqrt(i, var[1]);
var[1] *= i + 1;
thutility::math::rsqrt(i, var[2]);
var[2] *= i + 2;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "3 align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us "
<< var[0] << " - " << var[1] << " - " << var[2] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
thutility::math::rsqrt(i, var[1]);
var[1] *= i + 1;
thutility::math::rsqrt(i, var[2]);
var[2] *= i + 2;
thutility::math::rsqrt(i, var[3]);
var[3] *= i + 3;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "4 align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us "
<< var[0] << " - " << var[1] << " - " << var[2] << " - " << var[3] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
var[0] = i;
__m128& cache = reinterpret_cast<__m128&>(var);
__m128 mmsqrt = _mm_rsqrt_ss(cache);
cache = _mm_mul_ss(cache, mmsqrt);
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 SSE align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()
<< "us " << var[0] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
var[0] = i;
var[1] = i + 1;
var[2] = i + 2;
var[3] = i + 3;
__m128& cache = reinterpret_cast<__m128&>(var);
__m128 mmsqrt = _mm_rsqrt_ps(cache);
cache = _mm_mul_ps(cache, mmsqrt);
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "4 SSE align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << " - "
<< var[1] << " - " << var[2] << " - " << var[3] << std::endl;
system("PAUSE");
}
使用浮点类型的结果:
1 float SQRT: 24996us 2236.07
1 float RSQRT: 28003us 2236.07
1 align16 float[4] RSQRT: 32004us 2236.07
3 align16 float[4] RSQRT: 51013us 2236.07 - 2236.07 - 5e+006
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
1 SSE align16 float[4] RSQRT: 46999us 2236.07
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07
我的结论是,除非我们对不少于 4 个变量进行计算,否则不值得为 SSE2 烦恼。(也许这仅适用于 rsqrt 这里但它是一个昂贵的计算(它还包括多个乘法)所以它可能也适用于其他计算)
同样 sqrt(x) 比两次迭代的 x*rsqrt(x) 更快,而一次迭代的 x*rsqrt(x) 对于距离计算来说太不准确了。
所以我在一些板上看到的关于 x*rsqrt(x) 比 sqrt(x) 快的陈述是错误的。因此,除非您直接需要 1/x^(1/2),否则使用 rsqrt 代替 sqrt 是不合逻辑的,也不值得损失精度。
尝试不使用 SSE2 标志(如果它在正常 rsqrt 循环上应用 SSE,它会给出相同的结果)。
我的 RSQRT 是 quake rsqrt 的修改(相同)版本。
namespace thutility
{
namespace math
{
void rsqrt(const float& number, float& res)
{
const float threehalfs = 1.5F;
const float x2 = number * 0.5F;
res = number;
uint32_t& i = *reinterpret_cast<uint32_t *>(&res); // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
res = res * ( threehalfs - ( x2 * res * res ) ); // 1st iteration
res = res * ( threehalfs - ( x2 * res * res ) ); // 2nd iteration, this can be removed
}
}
}