fortran - OpenMP Working Precision

Question

I am having an issue with the simple code bellow. I am trying to use OpenMP with GFortran. The Results of the code bellow for x should be the same with AND without !$OMP statements, since the parallel code and serial code should output the same result.

program test
implicit none
!INCLUDE 'omp_lib.h'
integer i,j
Real(8) :: x,t1,t2

x=0.0d0
!$OMP PARALLEL DO PRIVATE(i,j) shared(X)
Do i=1,3
   Write(*,*) I
   !pause
   Do j=1,10000000
   !$OMP ATOMIC
    X=X+2.d0*Cos(i*j*1.0d0)
   end do
end do
!$OMP END PARALLEL Do

write(*,*) x    
end program test

But strangely I am getting the following results for x:

Parallel:-3.17822355415XXXXX

Serial: -3.1782235541569084

where XXXXX is some random digits. Every time I run the serial code, I get the same result (-3.1782235541569084). How can i fix it? Is this problem due to some OpenMP working precision option?

score 6 · Accepted Answer

Floating-point arithmetic is not strictly associative. In f-p arithmetic neither a+(b+c)==(a+b)+c nor a*(b*c)==(a*b)*c is always true, as they both are in real arithmetic. This is well-known, and extensively explained in answers to other questions here on SO and at other reputable places on the web. I won't elaborate further on that point here.

As you have written your program the order of operations by which the final value of X is calculated is non-deterministic, that is it may (and probably does) vary from execution to execution. The atomic directive only permits one thread at a time to update X but it doesn't impose any ordering constraints on the threads reaching the directive.

Given the nature of the calculation in your program I believe that the difference you see between serial and parallel executions may be entirely explained by this non-determinism.

Before you think about 'fixing' this you should first be certain that it is a problem. What makes you think that the serial code's answer is the one true answer ? If you were to run the loops backwards (still serially) and get a different answer (quite likely) which answer is the one you are looking for ? In a lot of scientific computing, which is probably the core domain for OpenMP, the data available and the numerical methods used simply don't support assertions of the accuracy of program results beyond a handful of significant figures.

If you still think that this is a problem that needs to be fixed, the easiest approach is to simply take out the OpenMP directives.

score 1 · Accepted Answer

To add to what High Performance Mark said, another source of discrepancy is that the compiler might have emitted x87 FPU instructions to do the math. x87 uses 80-bit internal precision and an optimised serial code would only use register arithmetic before it actually writes the final value to the memory location of X. In the parallel case, since X is a shared variable, at each iteration the memory location is being updated. This means that the 80-bit x87 FPU register is flushed to a 64-bit memory location and then read back, and some bits of precision are thus lost on each iteration, which then adds up to the observed discrepancy.

This effect is not present if modern 64-bit CPU is being used together with a compiler that emits SIMD instructions, e.g. SSE2+ or AVX. Those only work with 64-bit internal precision and then using only register addressing does not result in better precision than if the memory value is being flushed and reloaded in each iteration. In this case the difference comes from the non-associativity as explained by High Performance Mark.

Those effects are pretty much expected and usually accounted for. They are well studied and understood, and if your CFD algorithm breaks down when run in parallel, then the algorithm is highly numerically unstable and I would in no way trust the results it gives, even in the serial case.

By the way, a better way to implement your loop would be to use reduction:

!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:X)
Do i=1,3
   Write(*,*) I
   !pause
   Do j=1,10000000
      X=X+2.d0*Cos(i*j*1.0d0)
   end do
end do

This would allow the compiler to generate register-optimised code for each thread and then the loss of precision would only occur at the very end when the threads sum their local partial values in order to obtain the final value of X.

score -2 · Accepted Answer

I USED THE CLAUSE ORDERED WITH YOU CODE AND THIS WORK. BUT WORK WITH THIS CLAUSE IS THE SAME THAT RUN THE CODE IN SERIAL.

fortran - OpenMP Working Precision

3 回答 3

Related

Reference