On 2013-07-26 18:59, Bill Unruh wrote:
IF you can figure out what the "average" drift is, you could use adjtimex
to
adjust the system clock's rate to take that out.
No, I can't. As you correctly point out below, this is impossible for such a
high drift.
adjtimex --tick=13000
adjtimex: Invalid argument
for this kernel:
USER_HZ = 100 (nominally 100 ticks per second)
9000 <= tick <= 11000
-32768000 <= frequency <= 32768000
and indeed the system log does occasionally include:
chronyd[463]: Required tick 13194 outside allowed range (9000 .. 11000)
> What I don't understand is this: chrony logs the following in
> /var/log/messages:
> > chronyd[490]: System clock wrong by 15.124741 seconds, adjustment
> > started
> It does this (saying the clock is wrong by about 5-20s) even when the
> clock is wrong by hours.
I think that this is the "least squares" offset.
Alright, that's very different from what I thought it was.
It sends out a packet with a local time stamp. The remote server,
timestamps
the packet when it is received and when it is sent out again, and your
machine
timestamps it when it comes back. The measured offset is the difference
betweeen the means of the local timestamps and the remote timestamps.
chrony
then takes the last N offsets (compensated for changes it has made in the
drift rate of the clock) and does a least squares fit to find out what the
best estimate is for the drift error and offset error. It also tests to
see if
the deviations from the least squares fit look roughly random. If not, it
makes N smaller and tries again until N is 3. In your system N seems to
hang
around 3 a lot.
Thank you Bill for a *very* clear explanation. I think I finally understand
what you meant earlier - this system has 2 problems: Very uneven drift and
very high drift. The uneven drift causes the "Can't synchronise: no
majority" errors, and the high drift causes the "Required tick outside
allowed range" errors. So chrony cannot set an accurate adjustment nor a
quick enough adjustment to compensate.
That is beyond the ability of chrony (or anything) to correct. The max
drift
rate that can be compensated is 6 sec/minute. Jumping the clock is your
onlyoption.
If I was going to live with this system as-is, then you would be right.
For anyone else reading this: An easy way to diagnose a sick machine is to
use something like:
adjtimex -c=10 -i=10
--- current --- -- suggested --
cmos time system-cmos error_ppm tick freq tick freq
1374906355 -0.660995
1374906367 -3.003384 -234238.9 11000 0
1374906379 -5.041906 -203852.2 11000 0 13038 3421012
1374906390 -6.129622 -108771.6 11000 0 12087 4691487
1374906402 -8.383569 -225394.7 11000 0 13253 6206387
1374906415 -11.768661 -338509.2 11000 0 14385 603062
1374906428 -15.120310 -335164.9 11000 0 14351 4253587
1374906440 -17.387367 -226705.7 11000 0 13267 373175
1374906453 -20.810277 -342291.0 11000 0 14422 5963612
1374906465 -23.090249 -227997.2 11000 0 13279 6370600
if those suggested "tick" values on the right are >11000 (ie, drift >6s per
minute), then the timer is too broken for chrony to fix.
So when Bill tells you that your machine is very sick, listen to him. :-)
This is a "hardware" issue (in the case of a virtual machine, something more
elaborate) that needs to be fixed - in my case, by the hosting service
provider.
Also, for the record, although virtual machines do suffer from much more
drift problems than physical machines, there is a difference between an
"inaccurate" clock (typical of virtual machines), and a "broken" clock (not
so typical). Most virtual machines are not "broken" and chrony works just
fine.