c - 为什么 gcc 自动向量化不适用于大于 3x3 的卷积矩阵？

Question

我已经为卷积矩阵实现了以下程序

#include <stdio.h>
#include <time.h>

#define NUM_LOOP 1000
#define N 128   //input or output dimention 1
#define M N     //input or output dimention 2
#define P 5 //convolution matrix dimention 1 if you want a 3x3 convolution matrix it must be 3
#define Q P     //convolution matrix dimention 2
#define Csize P*Q   
#define Cdiv  1     //div for filter 
#define Coffset 0   //offset 

//functions
void unusual(); //unusual implementation of convolution
void naive();
//data
unsigned short int input[N][M] __attribute__(( aligned(32))); // input data
unsigned short int output[N][M] __attribute__(( aligned(32))); // out put data
unsigned short int kernel[P][Q] __attribute__(( aligned(32)));//convolution coefficients

int main(){
    struct timespec tStart, tEnd;//used to record the processiing time
    double tTotal , tBest=10000;//minimum of toltal time will asign to the best time

    int w=0;
    do{// this loop repeat the body to record the best time
        clock_gettime(CLOCK_MONOTONIC,&tStart);

        //function to be executed here :

        unusual();

        clock_gettime(CLOCK_MONOTONIC,&tEnd);
        tTotal = (tEnd.tv_sec - tStart.tv_sec);
        tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;

        if(tTotal<tBest)
            tBest=tTotal;
    } while(w++ < NUM_LOOP);

    printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, MAX1, MAX2);

    return 0;
}

//unusual sequential convolution
void unusual(){
    int i, j,k,temp;

    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;
            for(k=0; k< Csize; k++){
                temp += (kernel[k/P][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);

            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}
//The naive implementation
inline void naive(){
    int i, j,k,l,temp;
    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;

            for(k = 0; k <  P; k++){ 
                for(l = 0; l <  Q; l++){
                    temp += (kernel[k][l]) * (input[i - (P/2)+k][j - (Q/2)+l]);
                }
            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}

问题是当我-O3用于自动矢量化时，它只适用于 3x3 卷积矩阵。我已经看到组装输出和自动矢量化只是对 3x3 内核进行了一些更改并合理地提高了性能（快 20 倍注意：不寻常 func 的标量版本比 naive fun 慢）但是 5x5 卷积矩阵没有任何改进

更新：我在问题中添加了简单的实现，并将图片大小更改为 NxM，将卷积矩阵更改为内核，将 Cdim1xCdim2 更改为 PxQ，并将 seqConv 函数更改为不寻常的澄清。问题不在于改进异常功能的实现。问题是虽然所有元素都在内存的相同位置，但 gcc 使用启发式等。为什么 gcc 未能改进这种不寻常的实现？ 注意：问题不在于幼稚的实现。gcc -O3将 3x3、5x5 内核的简单实现提高了约 7 倍的速度。它也适用于 7x7 和 9x9 加速约 1.5 倍。为了改进卷积，我使用了内在函数，并且加速比简单的实现提高了 40 倍以上，比不寻常的卷积快了 2 倍。所以我的矢量化比我不寻常的矢量化快约 80 倍。手动调整优化不是问题。自动矢量化优化是问题所在，也是失败的原因。

GCC 命令：gcc -Wall -march=native -O3 -o "%e" "%f"

平台：Linux mint、Skylake、gcc 6.2

提前致谢

score 3 · Accepted Answer

似乎没有人有兴趣回答这个问题。因此，我将分享我的发现并在将来更新我的答案。

第一次更新：根据我的经验，gcc-fopt-info-vec报告矢量化Csize <= 16是因为矢量化因子是16和这是 gcc 不对其他内核大小的异常实现进行矢量化的原因之一。向量化因子是指可以放入向量中的元素的数量。在这种情况下short integer等于16-bit元素。

来自维基百科：

第一步，编译器寻找可以阻止向量化的障碍。矢量化的一个主要障碍是真正的数据依赖性比矢量长度短。其他障碍包括函数调用和短迭代次数。

score 3 · Accepted Answer

自动矢量化器的主要障碍是非常量循环变体。在您的实现中，如果您使用int Csize = P*Q;它不会被矢量化。因此，为了帮助自动矢量，您应该考虑这一点。这不是问题，因为您声明了Csizelike #define Csize。但是在你的作品中注意它。那么你不寻常的实现是中性实现的循环转换，这是编译器中的一种优化方法。看来你毁了天真的实现。你的发现说它受到限制，16所以我展开了你不寻常的功能，自动矢量化器说它已经被矢量化了。

for(k=0; k< P*Q; k+=2){
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
}

它也适用于 7x7 内核：

for(k=0; k< P*Q; k+=4){//IACA_START
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+2)/Q)][j - (Q/2) + ((k+2)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+3)/Q)][j - (Q/2) + ((k+3)%Q)]);
}

您不需要自己展开它，您可以强制编译器展开或通过#pragma属性更改循环结构。这是因为编译器用于自动矢量化的SLPSLP概念，有趣的是基于展开！

score 2 · Accepted Answer

我的猜测是，由于内存对齐问题，它未能优化。您已将卷积指定为 2 字节短裤。大多数 SSE 函数喜欢使用 128 位向量，而 AVX 喜欢 512 位向量。

在我的机器上，我这样声明 conv：

uint16_t conv[Cdim1][8] = {0}; //You need to pad extra fields with zeroes

然后像这样替换内部循环：

for(ki = 0; ki < Cdim; ++ki) 
    for(kj = 0; kj < 8; ++kj)
        temp += (conv[ki][kj]) * (input[i - (Cdim1/2) + ki][j - (Cdim2/2) + kj]);

编译：gcc so.c -Wall -Wextra -Ofast -mtune=native给了我向量优化！

坏事：

不要使用 8。尝试找到所需的最小填充并制作宏，以便它适用于维度 >= 8 的卷积矩阵
用一些零填充输入，以便最后未定义的行为消失
请注意，这实际上并没有增加您的性能。事实上，它工作得更慢！
请注意，如果您进一步修改它以按照以下顺序执行循环 for(ki) for(i) for(j) for(kj)，则可以压缩几个循环。这可能是由于寄存器压力较小，因为每行 conv 可以存储更长的时间。这也可能是我的 CPU 的故障。
您可能还想__attribute__ ((aligned (8)))在声明变量时考虑使用。在这种情况下，它没有改变任何东西，但是在优化时你也想考虑这一点。自然，这仅适用于 GCC，您将需要其他用于 MSVC 的 hack。

c - 为什么 gcc 自动向量化不适用于大于 3x3 的卷积矩阵？

3 回答 3

Related

Reference