parallel-processing - PGI 编译器并行化 +=

Question

我正在努力使向量和矩阵类并行化并且遇到了问题。任何时候我都有一个循环形式

for (int i = 0; i < n; i++) b[i] += a[i] ;

该代码具有数据依赖性，不会并行化。使用 intel 编译器时，它很聪明，可以在没有任何编译指示的情况下处理这个问题（我想避免使用编译指示进行无依赖检查，只是因为有大量与此类似的循环，而且情况实际上比这更复杂，并且我希望它检查以防万一确实存在）。

有谁知道允许这样做的 PGI 编译器的编译器标志？

谢谢，

贾斯汀

编辑：for循环中的错误。不是复制粘贴实际循环

score 1 · Accepted Answer

我认为问题在于您没有restrict在这些例程中使用关键字，因此 C 编译器必须担心指针别名。

编译这个程序：

#include <stdlib.h>
#include <stdio.h>

void dbpa(double *b, double *a, const int n) {
    for (int i = 0; i < n; i++) b[i] += a[i] ;

    return;
}

void dbpa_restrict(double *restrict b, double *restrict a, const int n) {
    for (int i = 0; i < n; i++) b[i] += a[i] ;

    return;
}

int main(int argc, char **argv) {
    const int n=10000;
    double *a = malloc(n*sizeof(double));
    double *b = malloc(n*sizeof(double));

    for (int i=0; i<n; i++) {
        a[i] = 1;
        b[i] = 2;
    }

    dbpa(b, a, n);
    double error = 0.;
    for (int i=0; i<n; i++)
        error += (3 - b[i]);

    if (error < 0.1)
        printf("Success\n");

    dbpa_restrict(b, a, n);
    error = 0.;
    for (int i=0; i<n; i++)
        error += (4 - b[i]);

    if (error < 0.1)
        printf("Success\n");

    free(b);
    free(a);
    return 0;
}

使用 PGI 编译器：

$ pgcc  -o tryautop tryautop.c -Mconcur -Mvect -Minfo
dbpa:
      5, Loop not vectorized: data dependency
dbpa_restrict:
     11, Parallel code generated with block distribution for inner loop if trip count is greater than or equal to 100
main:
     21, Loop not vectorized: data dependency
     28, Loop not parallelized: may not be beneficial
     36, Loop not parallelized: may not be beneficial

给我们的信息是dbpa()没有restrict 关键字的例程没有并行化，但dbpa_restict()例程是并行化的。

确实，对于这类东西，你最好只使用 OpenMP（或 TBB 或 ABB 或......），而不是试图说服编译器为你自动并行化；可能更好的是使用现有的线性代数包，无论是密集的还是稀疏的，这取决于你在做什么。

parallel-processing - PGI 编译器并行化 +=

1 回答 1

Related

Reference