c - 如何帮助 gcc 向量化 C 代码

Question

我有以下 C 代码。第一部分只是将复数矩阵从标准输入读入称为的矩阵M。有趣的部分是第二部分。

#include <stdio.h>
#include <complex.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>

int main() {
    int n, m, c, d;
    float re, im;

    scanf("%d %d", &n, &m);
    assert(n==m);
    complex float M[n][n];

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
    scanf("%f%fi", &re, &im);
    M[c][d] = re + im * I;
      }
    }

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
        printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d]));
      }
      printf("\n");
    }
/*
Example:input   
2 3
1+2i 2+3i 74-4i
3+4i 4+5i -7-8i
*/
    /* Part 2. M is now an n by n matrix of complex numbers */
    int s=1, i, j;
    int *f = malloc(n * sizeof *f);
    complex float *delta = malloc(n * sizeof *delta);
    complex float *v = malloc(n * sizeof *v);
    complex float p = 1, prod;

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      f[i] = i;
      delta[i] = 1;
    }
    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }
    free(delta);
    free(f);
    free(v);
    printf("%f + i%f\n", creal(p/pow(2.,(n-1))), cimag(p/pow(2.,(n-1))));
    return 0;
}

我用gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm. 这向我解释了为什么几乎没有循环被矢量化。

性能最重要的部分是第 47--50 行，它们是：

for (i = 0; i < n; i++) {
    v[i] -= 2.*delta[j]*M[j][i];
    prod *= v[i];
}

gcc 告诉我：

permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: Unsupported pattern.
permanent-in-c.c:47:7: note: not vectorized: unsupported use in stmt.
permanent-in-c.c:47:7: note: unexpected pattern.
[...]
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: IMAGPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: REALPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
[...]
permanent-in-c.c:48:26: note: Build SLP failed: unrolling required in basic block SLP
permanent-in-c.c:48:26: note: Failed to SLP the basic block.
permanent-in-c.c:48:26: note: not vectorized: failed to find SLP opportunities in basic block.

如何解决阻止这部分被矢量化的问题？

奇怪的是这部分是矢量化的，但我不知道为什么：

for (j = 0; j <n; j++) {
    v[i] += M[j][i];

gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 Permanent-in-cc -lm 的完整输出位于https://bpaste.net/show/18ebc3d66a53。

score 6 · Accepted Answer

首先，让我们详细检查代码。我们有

complex float  M[rows][cols];
complex float  v[cols];
float          delta[rows];
complex float  p = 1.0f;
float          s = 1.0f;

虽然delta[]是complex floatOP 代码中的类型，但它只包含-1.0for +1.0f。（此外，如果是-2.0f或+2.0f相反，计算可以被简化。）出于这个原因，我将其定义为真实而不复杂。

同样，OP 定义s为，int但仅有效地使用它（在计算中）。这就是为什么我将它明确定义为.-1.0f+1.0ffloat

我省略了f数组，因为有一种完全避免它的简单方法。

代码中有趣部分的第一个循环，

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      delta[i] = 1;
    }

执行几件事。delta[]它将数组中的所有元素初始化为1；它可以（并且可能应该）分成一个单独的循环。

由于外循环在中增加i，p将是中元素的乘积v；它也可以分成一个单独的循环。

因为内循环将列中的所有元素求和i为v[i]，所以外循环和内循环只是将每一行（作为向量）求和为向量v。

因此，我们可以用伪代码将上面的内容重写为

Copy first row of matrix M to vector v
For r = 1 .. rows-1:
    Add complex values in row r of matrix M to vector v

p = product of complex elements in vector v

delta = 1.0f, 1.0f, 1.0f, .., 1.0f, 1.0f

我们接下来看第二个嵌套循环：

    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }

除非您在循环进行时检查的值，否则很难看到j，但外循环主体中的最后 4 行在( 0,1,0,2,0,1,0j , 3,0,1,0,2,0,1,0,4,...)。此循环中的迭代次数为 2 ^rows-1 -1。这部分序列是对称的，并且实现了高度为 rows-1 的二叉树：

               4
       3               3
   2       2       2       2     (Read horizontally)
 1   1   1   1   1   1   1   1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

事实证明，如果我们循环遍历i= 1 .. 2 ^rows-1，那么r是中的零低位的数量i。GCC 提供了一个__builtin_ctz()内置函数，可以精确计算这一点。（请注意，这会__builtin_ctz(0)产生一个未定义的值；所以不要这样做，即使它恰好在您的计算机上产生了一个特定的值。）

内部循环从向量中减去j矩阵行上的复数值，按缩放。它还计算向量中的复数项的乘积（减法后）到变量中。2*delta[j]v[]v[]prod

在内循环之后，delta[j]被否定，比例因子也是如此s。变量的值prod，由缩放s，被添加到p。

在循环之后，最终（复杂）结果p除以 2 ^rows-1。最好使用ldexp()C99 函数（分别在实部和复数部分上）来完成。

因此，我们可以用伪代码将第二个嵌套循环重写为

s = 1.0f

For k = 1 .. rows-1, inclusive:
    r = __builtin_ctz(k), i.e. number of least
                          significant bits that
                          are zero in k

    Subtract the complex values on row r of matrix M,
        scaled by delta[r], from vector v[]

    prod = the product of (complex) elements in vector v[]

    Negate scale factor s (changing its sign)

    Add prod times s to result p

以我的经验，最好将实部和虚部分成单独的向量和矩阵。考虑以下定义：

typedef struct {
    float  *real;
    float  *imag;
    size_t  floats_per_row;  /* Often 'stride' */
    size_t  rows;
    size_t  cols;
} complex_float_matrix;

/* Set an array of floats to a specific value */
void float_set(float *, float, size_t);

/* Copy an array of floats */
void float_copy(float *, const float *, size_t);

/* malloc() vector-aligned memory; size in floats */
float *float_malloc(size_t);

/* Elementwise addition of floats */
void float_add(float *, const float *, size_t);

/* Elementwise addition of floats, scaled by a real scale factor */
void float_add_scaled(float *, const float *, float, size_t);

/* Complex float product, separate real and imag arrays */
complex float complex_float_product(const float *, const float *, size_t);

只要float_malloc()产生足够对齐的指针（并且编译器被告知，通过例如 GCC 函数属性__attribute__ ((__assume_aligned__ (BYTES_IN_VECTOR)));），上述所有内容都可以轻松向量化，并且floats_per_row矩阵中的成员是向量中浮点数的倍数。

（我不知道 GCC 是否可以自动向量化上述函数，但我知道它们可以使用 GCC 向量扩展“手动”向量化。）

有了上面，代码的整个有趣部分，在伪 C 中，变成了

complex float permanent(const complex_float_matrix *m)
{
    float         *v_real, *v_imag;
    float         *scale;   /* OP used 'delta' */
    complex float  result;  /* OP used 'p'     */
    complex float  product; /* OP used 'prod'  */
    float          coeff = 1.0f; /* OP used 's' */
    size_t         i = 1 << (m->rows - 1);
    size_t         r;

    if (!m || m->cols < 1 || m->rows < 1 || !i) {
        /* TODO: No input matrix, or too many rows in input matrix */
    }

    v_real = float_malloc(m->cols);
    v_imag = float_malloc(m->cols);
    scale  = float_malloc(m->rows - 1);
    if (!v_real || !v_imag || !scale) {
        free(scale);
        free(v_imag);
        free(v_real);
        /* TODO: Out of memory error */
    }

    float_set(scale, 2.0f, m->rows - 1);

    /* Sum matrix rows to v. */

    float_copy(v_real, m->real, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_real, m->real + r * m->floats_per_row, m->cols);

    float_copy(v_imag, m->imag, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_imag, m->imag + r * m->floats_per_row, m->cols);

    result = complex_float_product(v_real, v_imag, m->cols);

    while (--i) {
        r = __builtin_ctz(i);

        scale[r] = -scale[r];

        float_add_scaled(v_real, m->real + r * m->floats_per_row, m->cols);
        float_add_scaled(v_imag, m->imag + r * m->floats_per_row, m->cols);

        product = complex_float_product(v_real, v_imag, m->cols);

        coeff = -coeff;

        result += coeff * product;
    }

    free(scale);
    free(v_imag);
    free(v_real);

    return result;
}

在这一点上，我会在没有矢量化的情况下亲自实现上述内容，然后对其进行广泛的测试，直到我确定它可以正常工作。

然后，我会检查 GCC 程序集输出 ( -S)，看看它是否可以充分矢量化各个操作（我之前列出的函数）。

使用 GCC 矢量扩展对函数进行手动矢量化非常简单。声明一个浮点向量很简单：

typedef  float  vec2f __attribute__((vector_size (8), aligned (8)));  /* 64 bits; MMX, 3DNow! */
typedef  float  vec4f __attribute__((vector_size (16), aligned (16))); /* 128 bits; SSE */
typedef  float  vec8f __attribute__((vector_size (32), aligned (32))); /* 256 bits; AVX, AVX2 */
typedef  float  vec16f __attribute__((vector_size (64), aligned (64))); /* 512 bits; AVX512F */

每个向量中的各个分量可以使用数组表示法 (v[0]和v[1]for vec2f v;) 来寻址。GCC 可以按元素对整个向量进行基本操作；我们这里只需要加法和乘法。应该避免水平操作（在同一向量中的元素之间应用的操作），而是重新排序元素。

即使在没有这种向量化的架构上，GCC 也会为上述向量大小生成工作代码，但生成的代码可能很慢。（至少 5.4 的 GCC 版本会产生很多不必要的移动，通常是堆叠和返回。）

为向量分配的内存需要充分对齐。malloc()在所有情况下都不能提供足够对齐的内存；你应该posix_memalign()改用。当本地或静态分配一个时，该aligned属性可用于增加 GCC 用于向量类型的对齐方式。在矩阵中，您需要确保行从充分对齐的边界开始；这就是我floats_per_row在结构中有变量的原因。

如果向量（或行）中的元素数量很大，但不是向量中浮点数的倍数，您应该用“惰性”值填充向量 - 不影响结果的值，比如0.0f加减法和1.0f乘法。

至少在 x86 和 x86-64 上，GCC 将只使用指针为循环生成更好的代码。例如，这个

void float_set(float *array, float value, size_t count)
{
    float *const limit = array + count;
    while (array < limit)
        *(array++) = value;
}

产生比

void float_set(float *array, float value, size_t count)
{
    size_t i;
    for (i = 0; i < count; i++)
        array[i] = value;
}

或者

void float_set(float *array, float value, size_t count)
{
    while (count--)
        *(array++) = value;
}

（往往会产生类似的代码）。就个人而言，我将其实现为

void float_set(float *array, float value, size_t count)
{
    if (!((uintptr_t)array & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)array, 8);
        uint64_t        val;

        __builtin_memcpy(&val, &value, 4);
        __builtin_memcpy(4 + (char *)&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t        val;

        __builtin_memcpy(&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    }
}

并且float_copy()作为

void float_copy(float *target, const float *source, size_t count)
{
    if (!((uintptr_t)source & 7) &&
        !((uintptr_t)target & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)target, 8);
        uint64_t       *src = (uint64_t *)__builtin_assume_aligned((void *)source, 8);

        while (ptr < end)
            *(ptr++) = *(src++);

    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t       *src = (uint32_t *)__builtin_assume_aligned((void *)source, 4);

        while (ptr < end)
            *(ptr++) = *(src++);
    }
}

或类似的东西。

最难矢量化的是complex_float_product(). 如果您在最终向量中1.0f为实部和0.0f虚部填充未使用的元素，您可以轻松计算每个向量的复数乘积。请记住

( a + b i) × ( c + d i) = ( a c - b d ) + ( a d + b c ) i

这里的难点是有效地计算向量中元素的复数乘积。幸运的是，这部分对整体性能来说根本不是关键（除了非常短的向量或列数很少的矩阵），因此在实践中它应该不重要。

（简而言之，“困难”部分是找到对元素重新排序以最大限度地使用压缩向量乘法的方法，并且不需要这么多的洗牌/移动来减慢它的速度。）

对于该float_add_scaled()函数，您应该创建一个填充了比例因子的向量；类似于以下内容，

void float_add_scaled(float *array, const float *source, float scale, size_t count)
{
    const vec4f  coeff = { scale, scale, scale, scale };
    vec4f       *ptr = (vec4f *)__builtin_assume_aligned((void *)array, 16);
    vec4f *const end = (vec4f *)__builtin_assume_aligned((void *)(array + count), 16);
    const vec4f *src = (vec4f *)__builtin_assume_aligned((void *)source, 16);

    while (ptr < end)
        *(ptr++) += *(src++) * coeff;
}

如果我们忽略对齐和大小检查以及回退实现。

score 4 · Accepted Answer

我想我可能已经想通了。经过大量试验/错误后，很明显 gcc 内置的矢量化优化有点硬编码，它不能正确“理解”复数。我对代码进行了一些更改，并使您的内部性能敏感循环矢量化，由 gcc 输出确认（尽管我不确定所需的结果在计算上是否与您想要的结果等效）。虽然我的理解仅限于您希望代码执行的操作，但发现如果您分别计算 real 和 imag 它将正常工作。看一看：

    float t_r = 0.0, t_im = 0.0; // two new temporaries  
    while (j < n-1) {
        prod = 1.;
        for (i = 0; i < n; i++) {
// fill the temps after subtraction from V to avoid stmt error
            t_r = creal (v[i]) - (2. * creal(delta[j]) * creal (M[j][i]));
            t_im = cimag(v[i]) - (2. * cimag(delta[j]) * cimag (M[j][i])) * I;
            //v[i] = 2.*delta[j]*M[j][i];
            v[i] = t_r + t_im; // sum of real and img
            prod *= v[i];
        }
        delta[j] = -delta[j];
        s = -s;            
        p += s*prod;
        f[0] = 0;
        f[j] = f[j+1];
        f[j+1] = j+1;
        j = f[0];
    }

score -1 · Accepted Answer

优化器日志清楚地表明

访问的未知对齐方式：...

尝试矢量化时

printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d])); //24
v[i] += M[j][i]; //38
p *= v[i]; //40
v[i] -= 2.*delta[j]*M[j][i]; //48

看起来确实有关联，您需要强制对齐数组M和内存。deltav

GCC 中的自动矢量化

仅处理对齐的内存访问（不要尝试向量化包含未对齐访问的循环）

正如之前的评论中提到的，我建议您posix_memalign为此目的使用。

complex float * restrict delta;
posix_memalign(&delta, 64, n * sizeof *delta); //to adapt

你的目标环境是什么？（操作系统、CPU）

请查看data-alignment-to-assist-vectorization

c - 如何帮助 gcc 向量化 C 代码

3 回答 3

Related

Reference