Re: [chrony-users] Possible bug in PPS support |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-users Archives
]
William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ unruh@xxxxxxxxxxxxxx
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
On Mon, 23 Oct 2017, Rob Janssen wrote:
Bill Unruh wrote:
210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 377 24 +218ns[ +278ns] +/-
124ns
^- xxxxxx.xxxx.xxx 1 10 377 877 -147us[ -122us] +/-
11ms
^- xxxxxx.xxxx.xxx 1 10 377 14 +1480us[+1480us] +/-
10ms
^- xxx.xxxxxx.xxxx.xxx 1 10 377 345 +1446us[+1447us] +/-
10ms
However, recently at one site the PPS signal was lost, but chrony keeps
"locked" to it:
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PPS 0 4 0 13h -279ns[ -401ns] +/-
79ns
^- xxxxxx.xxxx.xxx 1 10 377 250 +3462us[+3462us] +/-
10ms
As can be seen, it has been lost for 13 hours but it still has the * sign
in the 2nd column.
We are remotely monitoring these systems using chronyc tracking and it
still indicated stratum 1 referenced to PPS.
I would have expected it to drop back to using those network time servers
after some time of not getting pulses
(i.e. once "Reach" is 0) and the stratum to increase to 2. When it would
operate that way, we would have
received an alert.
Furthermore, the clock had drifted by 3.5ms by the time the above status
was noticed, while when synchronized
to network time it usually is within 1 to 1.5ms. So it really is not
considering those network time sources anymore.
Not sure what the above paragraph means. How do you know it has drifted by
3.5ms or 1 ms? I do not believe those figures, unless you meant 3.5us and
1usec. If by remote monitoring you mean really really remote with dodgy
network between them.
Look in the above stats: it usually is at about 1.5ms (14xx us) from the
network time sources,
and when the error condition occurred, it was at 3462us offset.
There is a network between the source and the system, but it isn't dodgy.
Yes, it is. Note that it is saying that the standard deviation is 10ms. That
one particular measurement was only off by 1.5ms does not tell one anything.
The standard deviation tells much more.
And if it is off by 1.5 ms, that is still 10000 times worse than the PPS.
Was this a test by the way where you unplugged the gps from the machine.
Otherwise figuring out why gps pps was lost for that period of time is
probably the first thing to do.
We know what happened: the GPSDO went defective so there were no PPS pulses
anymore.
(and also no 10 MHz reference, which we need in another part of the system)
That is of course a different issue. And seeing no 10MHz reference is surely
something you can test for elsewhere.
What I would like to see is handling of the error condition. Of course it is
The purpose of chrony is to discipline the local clock Not to test GPS
receivers.
You could run a cron job which looks at the PPS reach every 5 min and if it
finds it has dropped to 0, it can do something like let you know your gps has
problems. But why should that be chrony's job? It is giving you the best
estimate of UTC it can given the data. I certainly would not want it giving me
worse estimates.
understandable that
there is no time syncing when there are no PPS pulses, but the condition
Sure there is. You can still use the past info from PPS to sync the current
clock.
should be visible.
(e.g. by the stratum increasing and/or the source changing)
Miroslav is better placed to figure out what is happening within chrony
when
it loses pps input. Given the uncertainty in the rate as estimated from the
PPS it, 13 hrs ago, is still probably a better estimate of the current time
than is the network time from the other systems.
It isn't! Network time from the other systems would be about 1500us out,
time was now 3400us out.
No idea what you mean. As I said I have seen no evidence about how you
determined those figures.
However, that is not the main point.
Remember that they are at poll 10 which is 1000 seconds or so (about 15
min)
so the network time sources have not had that many "measurements" in that
time
interval and those are pretty crappy (10ms std dev which is really huge).
The PPS std dev is inn the ns range-- about 10000 times better.
I don't think the shown output in the last column of "chronyc sources" is the
stddev.
Right now that column still indicates 10ms, but when I use "chronyc
sourcestats" the last
column actually has a header Std Dev and the values are around 40-60us.
So the PPS is
still, even 13 hrs later, a better estimate of the true time than are those
crappy network sources.
The network sources aren't crappy. There is a systematic offset but the
variation is low.
No, it is not.
I have no idea what the figure in the last column of sources means, it has no
header.
The above situation occurred with chrony 2.1
However, I have reproduced it with an installation updated to version 3.2
although with an "outage" time of 15 minutes.
It had Reach 0 but still was indicating lock to PPS after 869 seconds.
The star means that the PPS is the best indicator of what the true time now
is.
Even when it has not provided information for 13 hours?
Sure.
Is it to be considered a bug, or is this just a design feature?
It is neither a bug or a "design feature" (by which I assume you mean it is
not working properly but the designer does not care-- that is how it is
often
taken to mean).
Of course it could be that the design has a different objective.
We need the time to be very accurate (preferably within 2us but certainly
within 20us)
chrony's job is to try to do the best job it can with the data available of
disciplining the local clock. That is its job. That you want that discipline
to have a certain accuracy is a separate job, which you could handle by having
a cron job look at the log files for example.
and it looks like chrony is normally able to achieve that, but a design
feature could
be that it is freewheeling on loss of sync rather than indicating an error.
But if that freewheeling is more accurate than the other clock sources, why
would you object to freewheeling?
I don't mind that it is freewheeling but I need an indication of that -
because I need to
turn off our application as I know it does not take long for time to wander
out of the 20us
You know that how?
window. Assuming 3400us of wander in 13 hours we should not be without sync
Again, you have not told us how you determined that 3400us.
for
more than 5 minutes without knowing it.
Why? How are you arriving at these figures?
Here it indicates that the PPS is still, 13 hrs later, the
best indication of the offset from UTC. Now, this assumption that it is the
best could be off itself. For example if the time span used by the PPS was
overnight when the machine was cool inside, and during the day the machine
is
used a lot and heats up, then the estimate from the PPS rate could well be
off because those kinds of jump in the rate would not enter into the
estimate
of the skew for the PPS. (if the PPS had accumulated 64 samples at 16 sec
per
sample, that is only 15 min, so the time span over which the pps is
measuring
the rate and the changes in the rate is quite short and would not capture
large rate deviations which occur with non-gaussian distribution-- like the
heating up every morning)
Well, the systems are in a rack in airconditioned system rooms, it should not
be a big
problem.
It is internal temperatures, not room temperatures that are important.
How could we work around that in this case?
It is not clear what it is you want to work around? From all the data, the
PPS
13 hrs ago is still the best estimate of the UTC. Why would you want chrony
to
use a measureably much worse source just because the PPS has not been heard
from for 13 hrs? Eventually the PPS from the remote past is no longer as
good
as the relatively really crappy time from the network, but that could take
days.
What I need mostly is the information that there is no sync. And, I would
expect that
when chrony notices that the offset from the external sources is larger than
it usually
But it isn't.
is, that it starts tracking those sources instead of running free. But that
does not
really matter because the time offset is way out of our tolerances by that
time.
Again, how do you know?
Rob
--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with
"unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the
subject.
Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.
--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "help" in the subject.
Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.