To add to what High Performance Mark said, another source of discrepancy is that the compiler might have emitted x87 FPU instructions to do the math. x87 uses 80-bit internal precision and an optimised serial code would only use register arithmetic before it actually writes the final value to the memory location of X
. In the parallel case, since X
is a shared variable, at each iteration the memory location is being updated. This means that the 80-bit x87 FPU register is flushed to a 64-bit memory location and then read back, and some bits of precision are thus lost on each iteration, which then adds up to the observed discrepancy.
This effect is not present if modern 64-bit CPU is being used together with a compiler that emits SIMD instructions, e.g. SSE2+ or AVX. Those only work with 64-bit internal precision and then using only register addressing does not result in better precision than if the memory value is being flushed and reloaded in each iteration. In this case the difference comes from the non-associativity as explained by High Performance Mark.
Those effects are pretty much expected and usually accounted for. They are well studied and understood, and if your CFD algorithm breaks down when run in parallel, then the algorithm is highly numerically unstable and I would in no way trust the results it gives, even in the serial case.
By the way, a better way to implement your loop would be to use reduction:
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:X)
Do i=1,3
Write(*,*) I
!pause
Do j=1,10000000
X=X+2.d0*Cos(i*j*1.0d0)
end do
end do
This would allow the compiler to generate register-optimised code for each thread and then the loss of precision would only occur at the very end when the threads sum their local partial values in order to obtain the final value of X
.