

        num2 = _mm_set_pd(Phasor.imaginary, Phasor.real);

        for(int k=0; k<SamplesIneachPeriodCeil[iterationIndex]; k++) 
            /*SamplesIneachPeriodCeil[iterationIndex] is in range of 175000*/

            num1 = _mm_loaddup_pd(&OutSymbol[k].real);
            num3 = _mm_mul_pd(num2, num1);
            num1 = _mm_loaddup_pd(&OutSymbol[k].imaginary);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num4 = _mm_mul_pd(num2, num1);
            num3 = _mm_addsub_pd(num3, num4);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num5 = _mm_set_pd(InSymbolInt8[k],InSymbolInt8[k] );
            num6 = _mm_mul_pd(num3, num5);
            num7 = _mm_set_pd(Out[k].imaginary,Out[k].real);
            num8 = _mm_add_pd(num7,num6);
            _mm_storeu_pd((double *)&Out[k], num8);

        Out = Out + SamplesIneachPeriodCeil[iterationIndex];

这段代码给了我 ard 15milsec 的速度

当我修改代码以包含 openmp 作为


    int size = SamplesIneachPeriodCeil[iterationIndex];

#pragma omp parallel num_threads(2) shared(size)
        int start,end,tindex,tno,no_of_iteration;
        tindex = omp_get_thread_num();
        tno = omp_get_num_threads();
        start = tindex * size / tno;
        end = (1+ tindex)* size / tno ;
        num2 = _mm_set_pd(Phasor.imaginary, Phasor.real);
        int k;
        for(k = start ; k < end; k++){

            num1 = _mm_loaddup_pd(&OutSymbol[k].real);
            num3 = _mm_mul_pd(num2, num1);
            num1 = _mm_loaddup_pd(&OutSymbol[k].imaginary);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num4 = _mm_mul_pd(num2, num1);
            num3 = _mm_addsub_pd(num3, num4);
            //_mm_storeu_pd((double *)&newSymbol, num3);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num5 = _mm_set_pd(InSymbolInt8[k],InSymbolInt8[k] );
            num6 = _mm_mul_pd(num3, num5);
            num7 = _mm_set_pd(Out[k].imaginary,Out[k].real);
            num8 = _mm_add_pd(num7,num6);
            _mm_storeu_pd((double *)&Out[k], num8);

    Out = Out + size;

此代码显示的速度是 30 milsec



2 回答 2


You are doing nothing to distribute the execution of the loop between the two threads. You are just creating a parallel region with two threads and those threads execute exactly the same code. What you might want to do is to move the parallel region to only encompass the for loop and use the work-sharing construct:

int k;
#pragma omp parallel for num_threads(2) ...
for(k = start ; k < end; k++){

Thanks to Tudor for the correction. Your code is correctly parallelised but you have a parallel region inside a loop. Entering and exiting a parallel region is associated with some overhead. Usually this is described as "fork/join model" in which a team of threads is created on entering the region and then all threads are joined to the master on exiting. Most OpenMP runtimes use various thread pooling techniques to decrease the overhead but it is still there.

Your loop runs for 15 milliseconds. This is already fast enough compared to the OpenMP overhead and thus the overhead becomes visible. Think of moving the parallel region over the outer loop and the overhead should be reduced by a factor of up to 20 (depends on how often the else branch is taken) but you might still not see an improvement in the computation time.

Parallelisation is only aplicable to programs where the problem is large enough so that the communication or the synchronisation overhead would be negligible or at least small in comparison to the computation time.

于 2012-05-24T07:53:16.727 回答

您应该在外部循环(over)之外启动并行区域,并使用ifor 并行化 for 循环。循环内使用的所有变量 ( , ...) 最好只在其中声明,以便它们是自动的(实际上,它们中的大多数都可以重用,但编译器应该会发现我们无论如何)。komp fornum1num2private

于 2012-05-28T11:53:32.127 回答