c - 英特尔至强融核使用的内在函数是否比自动矢量化获得更好的性能？

Question

Intel Xeon Phi 提供使用“IMCI”指令集，
我用它来做“c = a*b”，像这样：

float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
    __m512 x_1Vec = _mm512_load_ps(x+i);
    __m512 y_1Vec = _mm512_load_ps(y+i);

    __m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
    _mm512_store_pd(z+i,ans);

}

并测试它的性能，当 N SIZE 为 1048576 时，
它需要花费 0.083317 秒，我想将性能与自动矢量化进行比较，
所以其他版本代码如下：

_Cilk_for(size_t i = 0; i < N; i++)
    z[i] = x[i] * y[i];

这个版本花费 0.025475 秒（但有时花费 0.002285 或更少，我不知道为什么？）
如果我将 _Cilk_for 更改为 #pragma omp parallel for，性能会很差。

那么，如果答案是这样的，为什么我们需要使用内在函数？
我在哪里犯错了吗？
有人可以给我一些优化代码的好建议吗？

score 3 · Accepted Answer

The measurements don't mean much, because of various mistakes.

The code is storing 16 floats as 8 doubles. The _mm512_store_pd should be _mm512_store_ps.
The code is using _mm512_store_... on an unaligned location with address z+i, which may cause a segmentation fault. Use __declspec(align(64)) to fix this.
The arrays x and y are not initialized. That risks introducing random numbers of denormal values, which might impact performance. (I'm not sure if this is an issue for Intel Xeon Phi).
There's no evidence that z is used, hence the optimizer might remove the calculation. I think it is not the case here, but it's a risk with trivial benchmarks like this. Also, allocating a large array on the stack risks stack overflow.
A single run of the examples is probably a poor benchmark, because the time is probably dominated by fork/join overheads of the _Cilk_for. Assuming 120 Cilk workers (the default for 60 4-way threaded cores), there is only about 1048576/120/16 = ~546 iterations per worker. With a clock rate over 1 GHz, that won't take long. In fact, the work in the loop is so small that most likely some workers never get a chance to steal work. That might account for why the _Cilk_for outruns OpenMP. In OpenMP, all the threads must take part in a fork/join for a parallel region to finish.

If the test were written to correct all the mistakes, it would essentially be computing z[:] = x[:]*y[:] on a large array. Because of the wide vector units on Intel(R) Xeon Phi(TM), this becomes a test of memory/cache bandwidth, not ALU speed, since the ALU is quite capable of outrunning memory bandwidth.

Intrinsics are useful for things that can't be expressed as parallel/simd loops, typically stuff needing fancy permutations. For example, I've used intrinsics to do a 16-element prefix-sum operation on MIC (only 6 instructions if I remember correctly).

score 0 · Accepted Answer

我在下面的回答同样适用于英特尔至强和英特尔至强融核。

内在基础解决方案是最“强大”的，就像“像”汇编编码一样。
- 但不利的一面是，基于内在函数的解决方案通常不是（大多数）可移植的，不是面向“生产力”的方法，并且通常不适用于已建立的“遗留”软件代码库。
- 再加上它通常要求程序员是低级甚至是微架构专家。
然而，有一些替代内在函数/汇编编码的方法。他们是：
- A）自动向量化（当编译器识别一些模式并自动生成向量代码时）
- B）“显式”或用户引导的向量化（当程序员就向量化的内容和条件等方面向编译器提供一些指导时；显式向量化通常意味着使用关键字或编译指示）
- C) 使用 VEC 类或其他类型的内在函数包装库，甚至是非常专业的编译器。事实上，就生产力和遗留代码增量更新而言，2.C 通常与内在编码一样糟糕）

在您的第二个代码片段中，您似乎使用了“显式”矢量化，目前在使用英特尔编译器的所有最新版本以及 GCC4.9 都支持的 Cilk Plus 和 OpenMP4.0“框架”时可以实现这一点。（我说您似乎使用显式矢量化，因为 Cilk_for 最初是为多线程而发明的，但是最新版本的英特尔编译器可能会在使用 cilk_for 时自动并行化和矢量化循环）

c - 英特尔至强融核使用的内在函数是否比自动矢量化获得更好的性能？

2 回答 2

Related

Reference