Re: [chrony-users] Possible bug in PPS support

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


Bill Unruh wrote:
210 Number of sources = 4
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS                           0   4   377    24   +218ns[ +278ns] +/- 124ns
^- xxxxxx.xxxx.xxx               1  10   377   877   -147us[ -122us] +/- 11ms
^- xxxxxx.xxxx.xxx               1  10   377    14 +1480us[+1480us] +/- 10ms
^- xxx.xxxxxx.xxxx.xxx           1  10   377   345 +1446us[+1447us] +/- 10ms

However, recently at one site the PPS signal was lost, but chrony keeps "locked" to it:

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS                           0   4     0   13h   -279ns[ -401ns] +/- 79ns
^- xxxxxx.xxxx.xxx               1  10   377   250 +3462us[+3462us] +/- 10ms

As can be seen, it has been lost for 13 hours but it still has the * sign in the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it still indicated stratum 1 referenced to PPS.

I would have expected it to drop back to using those network time servers after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2.  When it would operate that way, we would have
received an alert.

Furthermore, the clock had drifted by 3.5ms by the time the above status was noticed, while when synchronized
to network time it usually is within 1 to 1.5ms.  So it really is not considering those network time sources anymore.

Not sure what the above paragraph means. How do you know it has drifted by
3.5ms or 1 ms? I do not believe those figures, unless you meant 3.5us and
1usec. If by remote monitoring you mean really really remote with dodgy
network between them.
Look in the above stats: it usually is at about 1.5ms (14xx us) from the network time sources,
and when the error condition occurred, it was at 3462us offset.
There is a network between the source and the system, but it isn't dodgy.

Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.
We know what happened: the GPSDO went defective so there were no PPS pulses anymore.
(and also no 10 MHz reference, which we need in another part of the system)

What I would like to see is handling of the error condition.  Of course it is understandable that
there is no time syncing when there are no PPS pulses, but the condition should be visible.
(e.g. by the stratum increasing and/or the source changing)

Miroslav is better placed to figure out what is happening within chrony when
it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems.
It isn't!  Network time from the other systems would be about 1500us out, time was now 3400us out.
However, that is not the main point.
Remember that they are at poll 10 which is 1000 seconds or so (about 15 min)
so the network time sources have not had that many "measurements" in that time
interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about  10000 times better.
I don't think the shown output in the last column of "chronyc sources" is the stddev.
Right now that column still indicates 10ms, but when I use "chronyc sourcestats" the last
column actually has a header Std Dev and the values are around 40-60us.

So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.
The network sources aren't crappy.  There is a systematic offset but the variation is low.
I have no idea what the figure in the last column of sources means, it has no header.




The above situation occurred with chrony 2.1
However, I have reproduced it with an installation updated to version 3.2 although with an "outage" time of 15 minutes.
It had Reach 0 but still was indicating lock to PPS after 869 seconds.

The star means that the PPS is the best indicator of what the true time now
is.
Even when it has not provided information for 13 hours?


Is it to be considered a bug, or is this just a design feature?

It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is often
taken to mean).
Of course it could be that the design has a different objective.
We need the time to be very accurate (preferably within 2us but certainly within 20us)
and it looks like chrony is normally able to achieve that, but a design feature could
be that it is freewheeling on loss of sync rather than indicating an error.
I don't mind that it is freewheeling but I need an indication of that - because I need to
turn off our application as I know it does not take long for time to wander out of the 20us
window.  Assuming 3400us of wander in 13 hours we should not be without sync for
more than 5 minutes without knowing it.

Here it indicates that the PPS is still, 13 hrs later, the
best indication of the offset from UTC. Now, this assumption that it is the
best could be off itself. For example if the time span used by the PPS was
overnight when the machine was cool inside, and during the day the machine is
used a lot and heats up, then the estimate from the PPS rate could well be
off because those kinds of jump in the rate would not enter into the estimate
of the skew for the PPS. (if the PPS had accumulated 64 samples at 16 sec per
sample, that is only 15 min, so the time span over which the pps is measuring
the rate  and the changes in the rate is quite short and would not capture
large rate deviations which occur with non-gaussian distribution-- like the
heating up every morning)
Well, the systems are in a rack in airconditioned system rooms, it should not be a big
problem.

How could we work around that in this case?

It is not clear what it is you want to work around? From all the data, the PPS
13 hrs ago is still the best estimate of the UTC. Why would you want chrony to
use a measureably much worse source just because the PPS has not been heard
from for 13 hrs? Eventually the PPS from the remote past is no longer as good
as the relatively really crappy time from the network, but that could take
days.
What I need mostly is the information that there is no sync.   And, I would expect that
when chrony notices that the offset from the external sources is larger than it usually
is, that it starts tracking those sources instead of running free. But that does not
really matter because the time offset is way out of our tolerances by that time.

Rob

--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/