fortran - MPI_REDUCE 精度问题

Question

我在 fortran 中遇到了 MPI_REDUCE() 的精度问题。我已经测试了两种对存储在每个节点上的双精度数求和的方法。我使用的 MPI_REDUCE() 行是

call MPI_REDUCE(num,total,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierr)

它存储每个核心上“num”值的总和，并将其发送到根核心上的“total”。

我使用的另一种方法涉及发送和接收

if (rank .eq. 0) total = num
do i = 1,nproc-1
    if (rank .eq. i) call MPI_SEND(num,1,MPI_DOUBLE_PRECISION,0,&
                                  100,MPI_COMM_WORLD,ierr)
    if (rank .eq. 0) then
        call MPI_RECV(num,1,MPI_DOUBLE_PRECISION,i,&
                      100,MPI_COMM_WORLD,stat,ierr)
        total = total + num
    end if
end do

后者总是给我相同的总数，而前者根据我使用的处理器数量产生不同的值（它通常变化 1x10^-5 左右）。ierr 在所有情况下都是 0。难道我做错了什么？

谢谢

score 3 · Accepted Answer

Floating-point arithmetic is not strictly associative, the order in which operations are performed can have an impact on the result. While

(a+b)+c == a+(b+c)

is true for real (as in mathematical, rather than Fortran, terminology) numbers it is not (universally) true for floating-point numbers. It is not surprising, therefore, that the in-built reduction produces a result that differs from your own home-spun reduction. As you vary the number of processors you have no control over the order of individual additions in the computation; even on a fixed number of processors I wouldn't be surprised at a small difference between results of different executions of the same program. In contrast, your own reduction always does the additions in the same order.

What is the relative error of the two results ? The datum of 10^(-5) tells us only the absolute error and that doesn't allow us to conclude that your error can entirely be explained by the non-associativity of f-p arithmetic.

fortran - MPI_REDUCE 精度问题

1 回答 1

Related

Reference