0

Intel Xeon Phi 提供使用“IMCI”指令集,
我用它来做“c = a*b”,像这样:

float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
    __m512 x_1Vec = _mm512_load_ps(x+i);
    __m512 y_1Vec = _mm512_load_ps(y+i);

    __m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
    _mm512_store_pd(z+i,ans);

}

并测试它的性能,当 N SIZE 为 1048576 时,
它需要花费 0.083317 秒,我想将性能与自动矢量化进行比较,
所以其他版本代码如下:

_Cilk_for(size_t i = 0; i < N; i++)
    z[i] = x[i] * y[i];

这个版本花费 0.025475 秒(但有时花费 0.002285 或更少,我不知道为什么?)
如果我将 _Cilk_for 更改为 #pragma omp parallel for,性能会很差。

那么,如果答案是这样的,为什么我们需要使用内在函数?
我在哪里犯错了吗?
有人可以给我一些优化代码的好建议吗?

4

2 回答 2

3

The measurements don't mean much, because of various mistakes.

  • The code is storing 16 floats as 8 doubles. The _mm512_store_pd should be _mm512_store_ps.
  • The code is using _mm512_store_... on an unaligned location with address z+i, which may cause a segmentation fault. Use __declspec(align(64)) to fix this.
  • The arrays x and y are not initialized. That risks introducing random numbers of denormal values, which might impact performance. (I'm not sure if this is an issue for Intel Xeon Phi).
  • There's no evidence that z is used, hence the optimizer might remove the calculation. I think it is not the case here, but it's a risk with trivial benchmarks like this. Also, allocating a large array on the stack risks stack overflow.
  • A single run of the examples is probably a poor benchmark, because the time is probably dominated by fork/join overheads of the _Cilk_for. Assuming 120 Cilk workers (the default for 60 4-way threaded cores), there is only about 1048576/120/16 = ~546 iterations per worker. With a clock rate over 1 GHz, that won't take long. In fact, the work in the loop is so small that most likely some workers never get a chance to steal work. That might account for why the _Cilk_for outruns OpenMP. In OpenMP, all the threads must take part in a fork/join for a parallel region to finish.

If the test were written to correct all the mistakes, it would essentially be computing z[:] = x[:]*y[:] on a large array. Because of the wide vector units on Intel(R) Xeon Phi(TM), this becomes a test of memory/cache bandwidth, not ALU speed, since the ALU is quite capable of outrunning memory bandwidth.

Intrinsics are useful for things that can't be expressed as parallel/simd loops, typically stuff needing fancy permutations. For example, I've used intrinsics to do a 16-element prefix-sum operation on MIC (only 6 instructions if I remember correctly).

于 2014-05-23T19:57:43.540 回答
0

我在下面的回答同样适用于英特尔至强和英特尔至强融核。

  1. 内在基础解决方案是最“强大”的,就像“像”汇编编码一样。
    • 但不利的一面是,基于内在函数的解决方案通常不是(大多数)可移植的,不是面向“生产力”的方法,并且通常不适用于已建立的“遗留”软件代码库。
    • 再加上它通常要求程序员是低级甚至是微架构专家。
  2. 然而,有一些替代内在函数/汇编编码的方法。他们是:
    • A)自动向量化(当编译器识别一些模式并自动生成向量代码时)
    • B)“显式”或用户引导的向量化(当程序员就向量化的内容和条件等方面向编译器提供一些指导时;显式向量化通常意味着使用关键字或编译指示)
    • C) 使用 VEC 类或其他类型的内在函数包装库,甚至是非常专业的编译器。事实上,就生产力和遗留代码增量更新而言,2.C 通常与内在编码一样糟糕)

在您的第二个代码片段中,您似乎使用了“显式”矢量化,目前在使用英特尔编译器的所有最新版本以及 GCC4.9 都支持的 Cilk Plus 和 OpenMP4.0“框架”时可以实现这一点。(我说您似乎使用显式矢量化,因为 Cilk_for 最初是为多线程而发明的,但是最新版本的英特尔编译器可能会在使用 cilk_for 时自动并行化矢量化循环)

于 2014-05-22T14:23:05.020 回答