0

I have processes running on 10 computers which I expect to run in unison within about 30 ms.

In order to implement that feat, I (try to) synchronize the computers by sending a tick message to all the computers. Then I use a wait to for certain date which is the next 30 ms mark (so at a given second S + N × 30 ms).

For the sleep, I use the following command:

clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &f_deadline, nullptr);

For the synchronization, however, I determine S by getting that tick message mentioned above comparing a timestamp in the tick. To do so, I use the time() function like so:

if(tick.timestamp == time(nullptr))
{
    return tick;
}

Out of the 10 computers, there is nearly always 1 or 2 which are visually off by much more than 30ms. I would expect an error of ±30ms (while playing a video, this is equivalent to 1 frame at 30 fps—i.e. one frame early or one frame late instead of being right on time; as it stands, I have computers that are nearly 1 second off even though I'm supposed to re-align their timing every 30ms).

Note: I know that the software is not running so slowly that it can't do everything it needs to do in 30ms; I have counters which I display and the count is correct (i.e. 31 hits per second).

Could it be that there is a bug in the vDSO implementation of time(2) compared to other time functions (a.k.a. clock_gettime(2)) that it could give me a precision discrepancy of up to 1 sec. instead of a few ms?

P.S.: I'm running on NVidia Jetson Xavier computers based on ARM 64. So that specific implementation would be concerned, not the Intel based Linux Kernel.

Extra Note: I have ntpd running and the difference in time between all the computers is usually under 0.3 ms:

$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 10.0.2.1        .POOL.          16 p    -   64    0    0.000    0.000   0.000
*10.0.2.1        104.194.8.227    3 u  114 1024  377    0.217   -1.418   1.342

The delay, 0.217, represents the maximum discrepancy between those machines (one of which is the master clock, a.k.a. 10.0.2.1). The only pool command I put in the slave machines is:

pool 10.0.2.1 iburst

to actually make sure all the computers use the same clock.

4

0 回答 0