gcc - GCC 暗示向量化

Question

我希望 GCC 对下面的代码进行矢量化。-fopt-info告诉我 GCC 目前不是。我认为问题W在于k. 请注意，heightandwidth是常量，当前index_type设置为unsigned long。

我删除了一些评论

114  for (index_type k=height-1;k+1>0;k--) {
116    for (index_type i=0;i<width;i++) {
117      Yp[k*width + i] = 0.0;                                                            
119      for (index_type j=0;j<width;j++) {                                                                            
121        Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
122      }
123      Yp[k*width + i] *= DF(Ap[k*width + i]);
124    }
125  }

我正在编译gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm

有没有一种好方法可以使这个矢量化？你能给我介绍更多信息吗？

我的索引方法是不是一个坏主意（即k*width*width + ...）？我需要动态分配，并且我认为将事物保持在内存中会更好地进行优化。

编辑：这可能有用

-fopt-info-missed这些行的输出

./neural.c:114:3: note: not vectorized: multiple nested loops.
./neural.c:114:3: note: bad loop form.
./neural.c:116:5: note: not vectorized: control flow in loop.
./neural.c:116:5: note: bad loop form.
./neural.c:119:7: note: step unknown.
./neural.c:119:7: note: reduction used in loop.
./neural.c:119:7: note: Unknown def-use cycle pattern.
./neural.c:119:7: note: not vectorized: complicated access pattern.
./neural.c:119:7: note: bad data access.
./neural.c:110:21: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:58: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:62: note: not vectorized: not enough data-refs in basic block.
./neural.c:117:18: note: not vectorized: not enough data-refs in basic block.
./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.

编辑：

最小的例子是这里

我正在尝试使用 BLAS。在最小的例子中，它跑得更快，但在整个代码上它更慢......不知道为什么

编辑：

编译器正在优化代码。固定的。BLAS 现在更快了。修复是针对整个代码，而不是最小的示例。

编辑：

与上一次编辑的链接中的代码相同

#include <math.h>
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>

typedef float value_type;
typedef unsigned long index_type;

static value_type F(value_type v) {
  return 1.0/(1.0 + exp(-v));
}

static value_type DF(value_type v) {
  const value_type Ev = exp(-v);
  return Ev/((1.0 + Ev)*(1.0 + Ev));
}

#ifndef WITH_BLAS

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    for (index_type i=0;i<width;i++) {
      Yp[k*width + i] = 0.0;
      for (index_type j=0;j<width;j++) {
    Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
      Yp[k*width + i] *= DF(Ap[k*width + i]);
    }
  }
}

#else

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1,
        W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1);
    for (index_type i=0;i<width;i++)
      Yp[k*width + i] *= DF(Ap[k*width + i]);
  }
}

#endif

int main() {
  const index_type height=10, width=10000;

  value_type *Ap=malloc((height+1)*width*sizeof(value_type)),
    *W=malloc(height*width*width*sizeof(value_type)),
    *Yp=malloc((height+1)*width*sizeof(value_type)),
    *Dp=malloc(width*sizeof(value_type));

  get_Yp(Ap, W, Yp, Dp, height, width);
  printf("Done %f\n", Yp[3]);

  return 0;
}

score 1 · Accepted Answer

j-loop 是很好的矢量化SIMD 缩减循环，具有恒定的“宽度”元素步幅。您可以使用现代编译器对其进行矢量化。此代码可使用英特尔编译器进行矢量化，并且在某些情况下应可通过 GCC 进行矢量化。
首先，归约是“矢量化”真正循环携带依赖的特例。所以你不能安全地对其进行矢量化，除非“减少”模式是（a）编译器自动识别的（不是那么容易，严格来说不是那么有效/预期的行为）或（b）开发人员使用 OpenMP 或类似方式明确地与编译器通信标准。

要与编译器“沟通”有减少 - 您需要 #pragma omp simd reduction (+ : variable_name)在 j-loop 之前使用。

这仅从 OpenMP4.0 开始受支持。所以你必须使用支持 OpenMP4.x 的 GCC 版本。来自https://gcc.gnu.org/wiki/openmp的引用：“GCC 4.9 支持 C/C++ 的 OpenMP 4.0，也支持 Fortran 的 GCC 4.9.1”

我还将使用临时局部变量来累积减少（OpenMP4.0需要以这种方式使用减少变量）：

 tmpSUM = 0; 
 #pragma omp simd reduction (+: tmpSUM) 
 for (index_type j=0;j<width;j++) {                                                                            
        tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
 Yp[k*width + i] = tmpSUM

我还建议使用有符号的 int 而不是无符号的，因为无符号的归纳变量对于所有现代矢量化器来说都非常糟糕，至少会引入额外的开销。如果使用 unsigned 是“混淆”GCC 的主要原因之一，我不会感到惊讶。
现在，您可能对我的回复不满意，因为它说明了它应该如何工作（以及它在 ICC/ICPC 中如何工作）。正如在 GCC 优化报告中看到的那样，它没有考虑 GCC 的特定细微差别（这对于减少似乎是相反的）。

因此，如果您仍然仅限于 GCC，我建议：

确保它足够新鲜 GCC（4.9 或更高版本）
使用带符号的归纳变量并仍然尝试在临时本地 tmp SUM 上减少 omp simd （因为无论如何它应该启用更高级的矢量化技术）
如果以上没有帮助，请查看此处描述的“奇怪”事物（与您的情况非常相似）： gcc 的自动矢量化消息是什么意思？或考虑使用其他编译器/编译器版本。
1. 最后评论：你的代码中的访问模式和更普遍的 const-stride 这么糟糕吗？答：对于某些平台/编译器， const-stride不会影响您的性能。但是，理想情况下，您需要更多缓存友好的算法。检查不确定如何解释我的并行矩阵乘法代码的一些性能结果。如果您的代码确实受内存限制并且您没有时间自己处理内存优化，则另一种选择是考虑 MKL 或 BLAS（如其他回复所建议的）。

score 1 · Accepted Answer

我在 GCC 5.3.1、Clang 3.7.1、ICC 13.0.1 和 MSVC 2015 中测试了以下代码

void foo(float *x) 
    unsigned i;
    for (i=0; i<1024; i++) x[0] += x[1024 + i];
}

我用于-OfastGCC、Clang 和 ICC 以及/O2 /fp:fastMSVC。查看程序集表明只有 ICC 设法对循环进行矢量化。

但是，使用相同的编译选项，所有编译器都对以下代码进行了矢量化

void foo2(float *x) {
    float sum = 0.0f;
    unsigned i;
    for (i=0; i<1024; i++) sum += x[1024 + i];
    x[0] = sum;
}

我不确定为什么只有 ICC vectorized foo。我似乎很清楚没有依赖关系。

GCC 不会展开循环（-funroll-loops 也无济于事）。MSVC 展开两次，Clang 展开四次，ICC 八次。英特尔处理器，因为至少 Core2 有至少 3 个周期的加法延迟，因此展开四次或更多次将比两次或根本没有更好地工作。

在任何情况下使用

value_type sum = 0.0;
for (index_type j=0;j<width;j++) {
    sum += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
}
Yp[k*width + i] = sum*DF(Ap[k*width + i]);

向量化你的循环。

score 0 · Accepted Answer

以我的经验，要求 GCC 正确矢量化是很麻烦的。特别是如果您希望充分利用现代硬件（例如 AVX2）。我在我的代码中处理了很多向量化。我的解决方案：根本不要尝试使用 GCC 进行矢量化。而是根据 BLAS（基本线性代数子程序）调用来制定您希望矢量化的所有操作，并与为现代硬件实现 BLAS 的 OpenBLAS 库链接。OpenBLAS 还提供共享内存节点上的并行化（使用 pthread 或 OpenMP），根据您的应用程序，这对于进一步提高执行速度非常重要：通过这种方式，您可以将矢量化与（共享内存）并行化结合起来。

此外，我建议您调整内存以充分利用 AVX 和/或 AVX2 等。即不要使用malloc 或new 分配内存，使用memalign 或aligned_alloc（取决于您的系统支持的内容）。例如，如果您打算使用 AVX2，您应该调整分配，以便地址是 64 字节的倍数（8 * 8 双倍）。

gcc - GCC 暗示向量化

3 回答 3

Related

Reference