I have processes running on 10 computers which I expect to run in unison within about 30 ms.
In order to implement that feat, I (try to) synchronize the computers by sending a tick message to all the computers. Then I use a wait to for certain date which is the next 30 ms mark (so at a given second S + N × 30 ms).
For the sleep, I use the following command:
clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &f_deadline, nullptr);
For the synchronization, however, I determine S by getting that tick message mentioned above comparing a timestamp in the tick. To do so, I use the time()
function like so:
if(tick.timestamp == time(nullptr))
{
return tick;
}
Out of the 10 computers, there is nearly always 1 or 2 which are visually off by much more than 30ms. I would expect an error of ±30ms (while playing a video, this is equivalent to 1 frame at 30 fps—i.e. one frame early or one frame late instead of being right on time; as it stands, I have computers that are nearly 1 second off even though I'm supposed to re-align their timing every 30ms).
Note: I know that the software is not running so slowly that it can't do everything it needs to do in 30ms; I have counters which I display and the count is correct (i.e. 31 hits per second).
Could it be that there is a bug in the vDSO implementation of time(2)
compared to other time functions (a.k.a. clock_gettime(2)
) that it could give me a precision discrepancy of up to 1 sec. instead of a few ms?
P.S.: I'm running on NVidia Jetson Xavier computers based on ARM 64. So that specific implementation would be concerned, not the Intel based Linux Kernel.
Extra Note: I have ntpd
running and the difference in time between all the computers is usually under 0.3 ms:
$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
10.0.2.1 .POOL. 16 p - 64 0 0.000 0.000 0.000
*10.0.2.1 104.194.8.227 3 u 114 1024 377 0.217 -1.418 1.342
The delay, 0.217, represents the maximum discrepancy between those machines (one of which is the master clock, a.k.a. 10.0.2.1). The only pool
command I put in the slave machines is:
pool 10.0.2.1 iburst
to actually make sure all the computers use the same clock.