Re: [chrony-users] Possible bug in PPS support

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


210 Number of sources = 4
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 377 24 +218ns[ +278ns] +/- 124ns ^- xxxxxx.xxxx.xxx 1 10 377 877 -147us[ -122us] +/- 11ms ^- xxxxxx.xxxx.xxx 1 10 377 14 +1480us[+1480us] +/- 10ms ^- xxx.xxxxxx.xxxx.xxx 1 10 377 345 +1446us[+1447us] +/- 10ms

However, recently at one site the PPS signal was lost, but chrony keeps "locked" to it:

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 0 13h -279ns[ -401ns] +/- 79ns ^- xxxxxx.xxxx.xxx 1 10 377 250 +3462us[+3462us] +/- 10ms

As can be seen, it has been lost for 13 hours but it still has the * sign in the 2nd column. We are remotely monitoring these systems using chronyc tracking and it still indicated stratum 1 referenced to PPS.

I would have expected it to drop back to using those network time servers after some time of not getting pulses (i.e. once "Reach" is 0) and the stratum to increase to 2. When it would operate that way, we would have
received an alert.

Furthermore, the clock had drifted by 3.5ms by the time the above status was noticed, while when synchronized to network time it usually is within 1 to 1.5ms. So it really is not considering those network time sources anymore.

Not sure what the above paragraph means. How do you know it has drifted by
3.5ms or 1 ms? I do not believe those figures, unless you meant 3.5us and
1usec. If by remote monitoring you mean really really remote with dodgy
network between them.
Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.
Miroslav is better placed to figure out what is happening within chrony when
it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems. Remember that they are at poll 10 which is 1000 seconds or so (about 15 min)
so the network time sources have not had that many "measurements" in that time
interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about  10000 times better. So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.



The above situation occurred with chrony 2.1
However, I have reproduced it with an installation updated to version 3.2 although with an "outage" time of 15 minutes.
It had Reach 0 but still was indicating lock to PPS after 869 seconds.

The star means that the PPS is the best indicator of what the true time now
is.

Is it to be considered a bug, or is this just a design feature?

It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is often
taken to mean). Here it indicates that the PPS is still, 13 hrs later, the
best indication of the offset from UTC. Now, this assumption that it is the
best could be off itself. For example if the time span used by the PPS was
overnight when the machine was cool inside, and during the day the machine is
used a lot and heats up, then the estimate from the PPS rate could well be
off because those kinds of jump in the rate would not enter into the estimate
of the skew for the PPS. (if the PPS had accumulated 64 samples at 16 sec per
sample, that is only 15 min, so the time span over which the pps is measuring
the rate  and the changes in the rate is quite short and would not capture
large rate deviations which occur with non-gaussian distribution-- like the
heating up every morning)

How could we work around that in this case?

It is not clear what it is you want to work around? From all the data, the PPS
13 hrs ago is still the best estimate of the UTC. Why would you want chrony to
use a measureably much worse source just because the PPS has not been heard
from for 13 hrs? Eventually the PPS from the remote past is no longer as good
as the relatively really crappy time from the network, but that could take
days.



Rob

--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/