c++ - 未矢量化：不适合收集 D.32476_34 = *D.32475_33；

Question

我想让我的代码由编译器自动向量化，但我似乎无法正确处理。特别是我从它那里得到的消息-ftree-vectorizer-verbose=6 是125: not vectorized: not suitable for gather D.32476_34 = *D.32475_33;.

现在我的问题是这条消息的全部含义以及这些数字代表什么？

贝娄，我创建了一个产生相同消息的简单测试示例，所以我假设这些问题是相关的。

static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num)
{   
  for (int i = 0; i < indices_num; ++i)
  {
    int idx = indices[i] * 4;

    float r = pixels[idx + 0];
    float g = pixels[idx + 1];
    float b = pixels[idx + 2];
    float a = pixels[idx + 3] / 255.0f;

    pixels[idx + 0] = r;
    pixels[idx + 1] = g;
    pixels[idx + 2] = b;
    pixels[idx + 3] = a * 255.0f;
  }

  return;
}

此外，在创建我的示例时，我遇到了一大堆其他消息，我不确定它们的含义或为什么特定构造会出现矢量化问题，所以是否有任何指南、书籍、教程、博客等那会向我解释这些事情吗？

如果这很重要，我正在使用带有 QtCreator 2.7.0 的 MingW 4.7 32 位。

编辑：结论：

根据我在这篇文章中的测试和建议，该消息很可能与通过辅助索引数组间接访问数据有关，这导致了聚集/分散寻址方案，目前GCC无法（或不想）对此进行矢量化。不过，我能够生成矢量化代码clang++ 3.2-1。

score 2 · Accepted Answer

代码的矢量化版本在概念上看起来像（使用 OpenCL 语法）：

for (int i = 0; i < indices_num; ++i)
{
  int idx = indices[i] * 4;
  float4 factor = (1, 1, 1, 255.0f);

  char4 x1 = vload4(idx, pixels); // Line A
  float4 x2 = convert_float4(x1);
  float4 x3 = x2 / factor;
  float4 x4 = x3 * factor;
  char4 x5 = convert_char4(x4);
  vstore4(x5, idx, pixels); // Line B
}

但坚持; 在 A 行中，您尝试从内存中加载四个字符（又名 uint8），并将它们存储在 B 行中。这不是 x86 的常见功能；我所知道的唯一支持它的指令集是支持 AVX2（Intel Haswells 及更高版本）和 Xeon Phi 的。除非您正在编译其中之一，否则这可以解释为什么您的编译器拒绝这种矢量化机会。

编译器当然可以单独加载 4 个 uint8，从中构建一个向量，进行所需的向量运算，然后手动将 4 个值存储回来；但我猜测，与通过矢量化节省的实际工作量相比，如果没有收集和分散，单独加载和存储值可能被认为过于昂贵。

score 1 · Accepted Answer

试试这个代码，它有向量来乘（和除）你的要向量化的变量。：

static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num)
{   
  float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
  float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
  //Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)

  for (int i = 0; i < indices_num; ++i)
  {
    int idx = indices[i] * 4;

    float r = pixels[idx + 0]/dividerV[0];
    float g = pixels[idx + 1]/dividerV[1];
    float b = pixels[idx + 2]/dividerV[2];
    float a = pixels[idx + 3]/dividerV[3];

    pixels[idx + 0] = r*multiplierV[0];
    pixels[idx + 1] = g*multiplierV[1];
    pixels[idx + 2] = b*multiplierV[2];
    pixels[idx + 3] = a*multiplierV[3];
  }

  return;
}

也许这更容易矢量化。

Aginst 未知循环边界，尝试给出一个直接常量而不是 indices_num。这个编译器不是即时编译器（也许是，但我没有听说过除了 java 之外的其他东西）所以，给出一个编译时已知的常量可能会起作用。

这里：

static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{   
  float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
  float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
  //Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)

  for (int i = 0; i < 1000; ++i)
  {
    int idx = indices[i] * 4;

    float r = pixels[idx + 0]/dividerV[0];
    float g = pixels[idx + 1]/dividerV[1];
    float b = pixels[idx + 2]/dividerV[2];
    float a = pixels[idx + 3]/dividerV[3];

    pixels[idx + 0] = r*multiplierV[0];
    pixels[idx + 1] = g*multiplierV[1];
    pixels[idx + 2] = b*multiplierV[2];
    pixels[idx + 3] = a*multiplierV[3];
  }

  return;
}

有时数组未正确对齐以进行矢量化指令。例如，cpu 只能提高 32B（或 16B）对齐阵列的读/写性能。未对齐的读/写速度较慢（或不可矢量化）

这里：

static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{   
     float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
     float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits

     if(reinterpret_cast<size_t>pixels%32!=0)
     { 
      printf("array is not aligned! need to shift array or need to do serial calc. until aligned offset reached!");
      //do non-vectorized calc. When aligned offset reached, goto vectorizing code.
     }
     else
     {
       printf("array is aligned! Starting fast access.");
       for (int i = 0; i < 1000; ++i)
       {
           int idx = indices[i] * 4;

           float r = pixels[idx + 0]/dividerV[0];
           float g = pixels[idx + 1]/dividerV[1];
           float b = pixels[idx + 2]/dividerV[2];
           float a = pixels[idx + 3]/dividerV[3];

           pixels[idx + 0] = r*multiplierV[0];
           pixels[idx + 1] = g*multiplierV[1];
           pixels[idx + 2] = b*multiplierV[2];
           pixels[idx + 3] = a*multiplierV[3];
       }

       return;
   }
}

也许有人可以打开 memcpy 或一些数组复制 asm 文件并在其中注入一些乘法代码并编译为 memcpy_with_multiplication(,,,) ？

我的最后一个建议：将 r,g,b,a 包装在一个数组中，以便它们位于连续的地址中。这里：

static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{   
  float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
  float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
  //Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)

  for (int i = 0; i < 1000; ++i)
  {
    int idx = indices[i] * 4;
    float rgba[4];

    rgba[0] = pixels[idx + 0]/dividerV[0];
    rgba[1] = pixels[idx + 1]/dividerV[1];
    rgba[2] = pixels[idx + 2]/dividerV[2];
    rgba[3] = pixels[idx + 3]/dividerV[3];

    pixels[idx + 0] = rgba[0]*multiplierV[0];
    pixels[idx + 1] = rgba[1]*multiplierV[1];
    pixels[idx + 2] = rgba[2]*multiplierV[2];
    pixels[idx + 3] = rgba[3]*multiplierV[3];
  }

  return;
}

"indices[i]" 不是一个明确的索引参数。这可能很糟糕。尝试其他方式向编译器展示这一点。当你只放 i 而不是 indices[i] 时会发生什么？它编译相同吗？indices[i] 在编译时无法知道，或者对于编译器来说太复杂了。

更简单（也是错误的）和更可矢量化：

static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{   
  float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
  float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits

  //you need to sorted version of indices[](or pixels[]) array to achieve something like this.
  for (int i = 0; i < 4000; i+=4) 
  {
    float rgba[4];

    rgba[0] = pixels[i + 0]/dividerV[0];
    rgba[1] = pixels[i + 1]/dividerV[1];
    rgba[2] = pixels[i + 2]/dividerV[2];
    rgba[3] = pixels[i + 3]/dividerV[3];

    pixels[i + 0] = rgba[0]*multiplierV[0];
    pixels[i + 1] = rgba[1]*multiplierV[1];
    pixels[i + 2] = rgba[2]*multiplierV[2];
    pixels[i + 3] = rgba[3]*multiplierV[3];
  }

  return;
}

c++ - 未矢量化：不适合收集 D.32476_34 = *D.32475_33；

2 回答 2

Related

Reference