2

我有一个嵌套循环:(L 和 A 是完全定义的输入)

    #pragma omp parallel for schedule(guided) shared(L,A) \
    reduction(+:dummy)
    for (i=k+1;i<row;i++){
            for (n=0;n<k;n++){
                #pragma omp atomic
                dummy += L[i][n]*L[k][n];
                L[i][k] = (A[i][k] - dummy)/L[k][k];
            }
            dummy = 0;
    }

及其顺序版本:

    for (i=k+1;i<row;i++){
            for (n=0;n<k;n++){
                dummy += L[i][n]*L[k][n];
                L[i][k] = (A[i][k] - dummy)/L[k][k];
            }
            dummy = 0;
    }

他们都给出了不同的结果。并且并行版本比顺序版本慢得多。

什么可能导致问题?

编辑:

为了摆脱 atomic 指令带来的问题,我将代码修改如下:

#pragma omp parallel for schedule(guided) shared(L,A) \
    private(i)
    for (i=k+1;i<row;i++){
        double dummyy = 0;
        for (n=0;n<k;n++){
            dummyy += L[i][n]*L[k][n];
            L[i][k] = (A[i][k] - dummyy)/L[k][k];
        }
    }

但它也没有解决问题。结果仍然不同。

4

3 回答 3

2

I am not very familiar with OpenMP but it seems to me that your calculations are not order-independent. Namely, the result in the inner loop is written into L[i][k] where i and k are invariants for the inner loop. This means that the same value is overwritten k times during the inner loop, resulting in a race condition.

Moreover, dummy seems to be shared between the different threads, so there might be a race condition there too, unless your pragma parameters somehow prevent it.

Altogether, to me it looks like the calculations in the inner loop must be performed in the same sequential order, if you want the same result as given by the sequential execution. Thus only the outer loop can be parallelized.

于 2012-04-07T08:14:26.637 回答
2

在您的并行版本中,您插入了一个不必要的(并且可能有害的)原子指令。一旦您声明dummy为归约变量,OpenMP 就会负责停止干扰归约的线程。我认为不必要的指令的主要影响是减慢你的代码,很多。

我看到你有另一个答案来解决你的结果的错误。但我注意到您似乎在每次外循环迭代结束时设置dummy0也许您想减少到dummy整个内部循环?

如果您在还原方面遇到问题,请阅读此内容

于 2012-04-07T08:19:57.940 回答
1

结果的差异来自内部循环变量n,它在线程之间共享,因为它是在 omp pragma 之外定义的。

澄清:循环变量n应该在 omp pragma 中声明,因为它应该是线程特定的,例如for (int n = 0;.....)

于 2012-04-07T09:08:21.340 回答