It is very easy for MPI processes to become desynchronised in time, especially if the algorithms involved in MPI_stuff
are not globally synchronous. It is very typical with most cluster MPI implementations that processes are quite desynchronised from the very beginning due to the different start-up times and the fact that MPI_Init()
can take varying amount of time. Another source of desynchronisation is the OS noise, i.e. other processes occasionally sharing CPU time with some of the processes in the MPI job.
That's why the correct way to measure the execution time of a parallel algorithm is to put a barrier before and after the measured block:
MPI_Barrier(MPI_COMM_WORLD); // Bring all processes in sync
t = -MPI_Wtime();
MPI_stuff;
MPI_Barrier(MPI_COMM_WORLD); // Wait for all processes to finish processing
t += MPI_Wtime();
If the first MPI_Barrier
is missing and MPI_stuff
does not synchronise the different processes, it could happen that some of them arrive at the next barrier very early while others arrive very late, and then the early ones have to wait for the late ones.
Also note that MPI_Barrier
gives no guarantee that all processes exit the barrier at the same time. It only guarantees that there is a point in time when the execution flow in all processes is inside the MPI_Barrier
call. Everything else is implementation dependent. On some platforms, notably the IBM Blue Gene, global barriers are implemented using a special interrupt network and there MPI_Barrier
achieves almost cycle-perfect synchronisation. On clusters barriers are implemented with message passing and therefore barrier exit times might vary a lot.