c - GCC：两个相似循环之间的向量化差异

Question

使用编译时gcc -O3，为什么以下循环不矢量化（自动）：

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foo () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] = b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}

当下面一个呢？

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foov () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] += b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}

唯一的区别是内部循环中表达式的结果是分配给 a[i]，还是添加到 a[i]。

作为参考-ftree-vectorizer-verbose=6，给出了第一个（非矢量化）循环的以下输出。

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: not vectorized: live stmt not supported: D.2700_5 = c[j_20];

v.c:5: note: vectorized 0 loops in function.

向量化循环的相同输出是：

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: vect_model_load_cost: aligned.
v.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
v.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
v.c:9: note: vect_model_reduction_cost: inside_cost = 1, outside_cost = 6 .
v.c:9: note: cost model: prologue peel iters set to vf/2.
v.c:9: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
v.c:9: note: Cost model analysis:
  Vector inside of loop cost: 3
  Vector outside of loop cost: 27
  Scalar iteration cost: 3
  Scalar outside cost: 7
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 8

v.c:9: note:   Profitability threshold = 7

v.c:9: note: Profitability threshold is 7 loop iterations.
v.c:9: note: LOOP VECTORIZED.
v.c:5: note: vectorized 1 loops in function.

score 31 · Accepted Answer

在第一种情况下a[i]：代码在每次迭代中覆盖相同的内存位置。由于循环迭代不是独立的，因此这固有地使循环顺序化。
（实际上，实际上只需要最后一次迭代。因此可以取出整个内循环。）

在第二种情况下：GCC 将循环识别为归约操作 - 为此它具有特殊情况处理来向量化。

编译器矢量化通常被实现为某种“模式匹配”。这意味着编译器会分析代码以查看它是否符合它能够矢量化的特定模式。如果是这样，它将被矢量化。如果没有，那就没有。

这似乎是第一个循环不适合 GCC 可以处理的任何预编码模式的极端情况。但第二种情况符合“矢量化减少”模式。

这是 GCC 源代码中吐出该"not vectorized: live stmt not supported: "消息的相关部分：

http://svn.open64.net/svnroot/open64/trunk/osprey-gcc-4.2.0/gcc/tree-vect-analyze.c

if (STMT_VINFO_LIVE_P (stmt_info))
{
    ok = vectorizable_reduction (stmt, NULL, NULL);

    if (ok)
        need_to_vectorize = true;
    else
        ok = vectorizable_live_operation (stmt, NULL, NULL);

    if (!ok)
    {
        if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
        {
            fprintf (vect_dump, 
                "not vectorized: live stmt not supported: ");
            print_generic_expr (vect_dump, stmt, TDF_SLIM);
        }
        return false;
    }
}

从这一行开始：

vectorizable_reduction (stmt, NULL, NULL);

很明显，GCC 正在检查它是否匹配“矢量化缩减”模式。

score 4 · Accepted Answer

GCC 矢量化器可能不够聪明，无法矢量化第一个循环。加法的情况更容易矢量化，因为a + 0 == a. 考虑SIZE==4：

  0 1 2 3 i
0 X
1 X X
2 X X X
3 X X X X
j

X表示何时分配或增加i的组合。对于加法的情况，我们可以计算for 的结果，例如，并将其放入 vector中。然后我们只需要归零并将结果向量添加到. 对于分配的情况，它有点棘手。我们不仅要归零，而且还要归零，然后才能合并结果。我想这是矢量化器失败的地方。jab[i] > c[j] ? b[i] : c[j]j==1i==0..4DD[2..3]a[0..3]D[2..3]A[0..1]

score 4 · Accepted Answer

第一个循环相当于

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foo () {
  int i, j;

  for (i=0; i<SIZE; i++){
    a[i] = b[i] > c[SIZE - 1] ? b[i] : c[SIZE - 1];
  }
  return a[0];
}

原始表达式的问题在于它真的没有那么大的意义，所以 gcc 不能向量化它也就不足为奇了。

score 1 · Accepted Answer

第一个只是简单地更改 a[] 多次（临时）。第二个每次都使用 a[] 的最后一个值（不是临时的）。

在补丁版本之前，您可以使用“volatile”变量进行矢量化。

采用

int * c=malloc(sizeof(int));

使其对齐；

v.c:9: note: Unknown alignment for access: c

显示“c”具有与 b 和 a 不同的存储区域。

我假设“矢量化”的类似“movaps”的指令（来自 SSE-AVX 指令列表）

这里： http: //gcc.gnu.org/projects/tree-ssa/vectorization.html#using

第 6 和第 7 个例子显示了类似的困难。

c - GCC：两个相似循环之间的向量化差异

4 回答 4

Related

Reference