我正在尝试使用 AVX 内在函数编写几何平均 sqrt(a * b),但它的运行速度比糖蜜慢!
int main()
{
int count = 0;
for (int i = 0; i < 100000000; ++i)
{
__m128i v8n_a = _mm_set1_epi16((++count) % 16),
v8n_b = _mm_set1_epi16((++count) % 16);
__m128i v8n_0 = _mm_set1_epi16(0);
__m256i temp1, temp2;
__m256 v8f_a = _mm256_cvtepi32_ps(temp1 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_a, v8n_0)), _mm_unpackhi_epi16(v8n_a, v8n_0), 1)),
v8f_b = _mm256_cvtepi32_ps(temp2 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_b, v8n_0)), _mm_unpackhi_epi16(v8n_b, v8n_0), 1));
__m256i v8n_meanInt32 = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_mul_ps(v8f_a, v8f_b)));
__m128i v4n_meanLo = _mm256_castsi256_si128(v8n_meanInt32),
v4n_meanHi = _mm256_extractf128_si256(v8n_meanInt32, 1);
g_data[i % 8] = v4n_meanLo;
g_data[(i + 1) % 8] = v4n_meanHi;
}
return 0;
}
这个谜团的关键在于我使用的是 Intel ICC 11,并且在使用 icc -O3 sqrt.cpp 编译时它只会很慢。如果我用 icc -O3 -xavx sqrt.cpp 编译,那么它的运行速度会快 10 倍。
但是是否发生仿真并不明显,因为我使用了性能计数器,并且为两个版本执行的指令数大约为 4G:
Performance counter stats for 'a.out':
16867.119538 task-clock # 0.999 CPUs utilized
37 context-switches # 0.000 M/sec
8 CPU-migrations # 0.000 M/sec
281 page-faults # 0.000 M/sec
35,463,758,996 cycles # 2.103 GHz
23,690,669,417 stalled-cycles-frontend # 66.80% frontend cycles idle
20,846,452,415 stalled-cycles-backend # 58.78% backend cycles idle
4,023,012,964 instructions # 0.11 insns per cycle
# 5.89 stalled cycles per insn
304,385,109 branches # 18.046 M/sec
42,636 branch-misses # 0.01% of all branches
16.891160582 seconds time elapsed
------------------------------------ 使用 -xavx------------ ----------------------------
Performance counter stats for 'a.out':
1288.423505 task-clock # 0.996 CPUs utilized
3 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
279 page-faults # 0.000 M/sec
2,708,906,702 cycles # 2.102 GHz
1,608,134,568 stalled-cycles-frontend # 59.36% frontend cycles idle
798,177,722 stalled-cycles-backend # 29.46% backend cycles idle
3,803,270,546 instructions # 1.40 insns per cycle
# 0.42 stalled cycles per insn
300,601,809 branches # 233.310 M/sec
15,167 branch-misses # 0.01% of all branches
1.293986790 seconds time elapsed
是否正在进行某种处理器内部仿真?我知道非正规数,加起来最终比正常慢 64 倍。