Re: [chrony-users] Sporadic NTP dropouts (bad maximum delay ratio and maximum delay dev ratio)

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


Sorry for the long silence. I finally had the chance to debug this on the real system. There were indeed situations where the test persons just rebooted the whole system. The true origin of the problem are cases where chronyc is not able to talk to the daemon ("506 cannot talk to daemon"), thus the monitoring tool using chronyc reported status unsynchronized. This happened during normal operation of the system.
I could not find any hints in system logs that chrony daemon reported problems or was restarted. Can you give me any hints where and how to debug these occurrences?


Am 29. Mai 2015 21:22:57 MESZ, schrieb Bill Unruh <unruh@xxxxxxxxxxxxxx>:
It really looks like your system is being switched off and then on again. The
itnial is a burst of about 5 queries at 2 sec intervals. Then it does 32 sec
intervals (poll 5) and suddenly it dies for 8 min and starts up the 2 sec
intervals again. Something is basically stopping chrony and restarting it .

Note also that the number of polls used has suddenly dropped from 14 to 3 (the
minimum) which again suggests that the system stopped and restarted.

Are you sure that your C program does not tell chrony to stop when it queries
it?






William G. Unruh | Canadian Institute for| Tel: +1(604)822-3273
Physics&Astronomy | Advanced Research | Fax: +1(604)822-5324
UBC, Vancouver,BC | Program in Cosmology | unruh@xxxxxxxxxxxxxx
Canada V6T 1Z1 | and Gravity | www.theory.physics.ubc.ca/

On Fri, 29 May 2015, Ulrich Schwesinger wrote:

Thanks for the helpful feedback so far. I contacted the server admin, and he says there should be nothing that makes the server
jump...

I am wondering about a couple of things:
* Is there any specific test among these all, or a combination of those which will make the status go to unsynched?
* Why is there for example an 8 seconds gap in the log entries? Usually they are < 1 second. It also looks to me that score
and root dispersion (sorry, not sure what that exactly is) are kind of "reset" when that happens.
* Why does the stratum test fail? Can you explain what happens?


I have attached another failure case with information from all 3 logs. I really appreciate you guys helping, thanks!!




Date (UTC) Time IP Address L St 1234 abc 5678 LP RP Score Offset Peer del. Peer disp. Root del. Root disp.


2015-05-26 10:50:33 192.168.0.30 N 13 1111 111 1111 5 5 0.00 1.800e-05 6.681e-04 8.968e-02 0.000e+00 7.948e+00
2015-05-26 10:50:35 192.168.0.30 N 13 1111 111 1111 5 5 0.00 5.542e-03 5.040e-04 6.726e-02 0.000e+00 7.948e+00
2015-05-26 10:50:37 192.168.0.30 N 13 1111 111 1111 5 5 0.00 1.350e-02 6.620e-04 3.784e-02 0.000e+00 7.947e+00
2015-05-26 10:50:39 192.168.0.30 N 13 1111 111 1111 5 5 0.00 3.107e-03 6.741e-04 1.877e-02 0.000e+00 7.948e+00
2015-05-26 10:51:12 192.168.0.30 N 13 1111 111 1111 5 5 0.00 -2.444e-03 6.790e-04 3.731e-06 0.000e+00 7.948e+00
2015-05-26 10:51:44 192.168.0.30 N 13 1111 111 1111 5 5 0.00 -3.990e-03 7.220e-04 1.744e-06 0.000e+00 3.948e+00
2015-05-26 10:52:17 192.168.0.30 N 13 1111 111 1111 5 5 0.03 -4.055e-03 6.850e-04 1.116e-06 0.000e+00 3.948e+00
2015-05-26 10:52:49 192.168.0.30 N 13 1111 111 1111 5 5 0.08 -2.306e-03 6.820e-04 1.141e-06 0.000e+00 1.948e+00
2015-05-26 10:53:22 192.168.0.30 N 13 1111 111 1111 5 5 0.17 1.900e-05 6.740e-04 1.060e-06 0.000e+00 1.949e+00
2015-05-26 10:53:54 192.168.0.30 N 13 1111 111 1111 5 5 0.28 -3.000e-05 6.620e-04 1.122e-06 0.000e+00 9.482e-01
2015-05-26 10:54:26 192.168.0.30 N 13 1111 111 1111 5 5 0.42 0.000e+00 6.820e-04 1.144e-06 0.000e+00 9.487e-01
2015-05-26 10:54:58 192.168.0.30 N 13 1111 111 1111 5 5 0.58 -2.400e-05 6.940e-04 1.046e-06 0.000e+00 4.483e-01
2015-05-26 10:55:31 192.168.0.30 N 13 1111 111 1111 5 5 0.78 -1.200e-05 7.160e-04 1.100e-06 0.000e+00 4.488e-01
2015-05-26 10:56:03 192.168.0.30 N 13 1111 111 1111 5 5 1.00 -4.000e-05 6.740e-04 1.069e-06 0.000e+00 1.984e-01


Date (UTC) Time IP Address L St 1234 abc 5678 LP RP Score Offset Peer del. Peer disp. Root del. Root disp.


2015-05-26 11:04:25 192.168.0.30 N 13 1111 111 1111 5 5 0.00 -2.820e-04 6.490e-04 2.336e-06 0.000e+00 7.948e+00
2015-05-26 11:04:27 192.168.0.30 ? 16 1111 111 1001 5 5 0.00 -6.820e-04 6.341e-04 2.336e-06 0.000e+00 0.000e+00




Date (UTC) Time IP Address Std dev'n Est offset Offset sd Diff freq Est skew Stress Ns Bs Nr


2015-05-26 10:50:37 192.168.0.30 7.903e-04 -1.340e-02 5.192e-04 -3.315e-03 2.150e-01 1.7e+00 3 0 3
2015-05-26 10:50:39 192.168.0.30 1.205e-03 -3.355e-03 6.819e-04 -2.391e-03 1.222e-02 4.3e-03 4 0 3
2015-05-26 10:51:12 192.168.0.30 1.044e-04 2.444e-03 4.670e-05 8.745e-06 8.600e-04 2.0e-01 5 0 3
2015-05-26 10:51:44 192.168.0.30 1.013e-04 3.988e-03 6.019e-05 1.231e-04 2.202e-05 1.3e-01 6 0 3
2015-05-26 10:52:17 192.168.0.30 8.010e-05 4.048e-03 4.742e-05 1.236e-04 7.696e-06 3.0e-02 7 0 3
2015-05-26 10:52:49 192.168.0.30 6.815e-05 2.296e-03 3.978e-05 7.004e-05 3.901e-06 4.0e-02 8 0 3
2015-05-26 10:53:22 192.168.0.30 5.993e-05 -1.048e-05 3.397e-05 -5.451e-08 2.333e-06 3.4e-02 9 0 4
2015-05-26 10:53:54 192.168.0.30 5.444e-05 1.598e-05 2.981e-05 1.334e-07 1.557e-06 5.8e-02 10 0 5
2015-05-26 10:54:26 192.168.0.30 4.982e-05 2.656e-07 2.672e-05 1.351e-08 1.111e-06 1.0e-03 11 0 6
2015-05-26 10:54:58 192.168.0.30 4.679e-05 9.727e-06 2.459e-05 6.208e-08 8.430e-07 5.4e-02 12 0 7
2015-05-26 10:55:31 192.168.0.30 4.411e-05 4.470e-06 2.280e-05 3.192e-08 6.599e-07 2.7e-02 13 0 7
2015-05-26 10:56:03 192.168.0.30 4.294e-05 1.455e-05 2.140e-05 7.539e-08 5.373e-07 1.1e-01 14 0 9


Date (UTC) Time IP Address Std dev'n Est offset Offset sd Diff freq Est skew Stress Ns Bs Nr


2015-05-26 11:04:38 192.168.0.30 5.068e-06 2.591e-03 6.470e-06 1.735e-04 3.579e-04 8.7e-02 3 0 3
2015-05-26 11:04:40 192.168.0.30 5.523e-06 3.857e-04 5.336e-06 1.739e-04 1.352e-05 1.0e-03 4 0 3



Date (UTC) Time IP Address St Freq ppm Skew ppm Offset L Co Offset sd Rem. corr.


2015-05-26 10:50:33 0.0.0.0 0 -76.325 0.364 0.000e+00 ? 0 0.000e+00 0.000e+00
2015-05-26 10:50:37 192.168.0.30 14 -76.325 0.364 -1.341e-02 N 1 5.192e-04 -1.135e-01
2015-05-26 10:50:39 192.168.0.30 14 -76.325 0.364 -3.356e-03 N 1 6.819e-04 -5.630e-02
2015-05-26 10:51:12 192.168.0.30 14 -76.325 0.364 2.444e-03 N 1 4.670e-05 -4.000e-06
2015-05-26 10:51:44 192.168.0.30 14 -76.224 3.906 3.988e-03 N 1 6.019e-05 0.000e+00
2015-05-26 10:52:17 192.168.0.30 14 -22.335 66.861 4.048e-03 N 1 4.742e-05 0.000e+00
2015-05-26 10:52:49 192.168.0.30 14 47.622 6.329 2.296e-03 N 1 3.978e-05 0.000e+00
2015-05-26 10:53:22 192.168.0.30 14 47.570 2.518 -1.048e-05 N 1 3.397e-05 0.000e+00
2015-05-26 10:53:54 192.168.0.30 14 47.688 1.708 1.598e-05 N 1 2.981e-05 4.000e-06
2015-05-26 10:54:26 192.168.0.30 14 47.700 1.189 2.656e-07 N 1 2.672e-05 -4.000e-06
2015-05-26 10:54:58 192.168.0.30 14 47.753 0.914 9.727e-06 N 1 2.459e-05 -3.000e-06
2015-05-26 10:55:31 192.168.0.30 14 47.780 0.709 4.470e-06 N 1 2.280e-05 -1.000e-06
2015-05-26 10:56:03 192.168.0.30 14 47.843 0.593 1.455e-05 N 1 2.140e-05 -2.000e-06


Date (UTC) Time IP Address St Freq ppm Skew ppm Offset L Co Offset sd Rem. corr.


2015-05-26 11:04:25 0.0.0.0 0 -76.325 0.364 0.000e+00 ? 0 0.000e+00 0.000e+00

Am 28.05.2015 um 22:14 schrieb Bill Unruh:
On Thu, 28 May 2015, Ulrich Schwesinger wrote:


Tests 1234 abc 5678 are defined in the ntp specifications.

From RFC 1305

TEst 1 and 2 test to make sure that the timestamps make sense (eg are not the
wsame as an old packet and pairs with the last one sent to that peer) test3 is
that the originate and receive timestamps are non-zero, Test 4 is taht the
delay (round trip time) be reasonable, abc are subsets of that, 5 is
authentication, test 6 irequires peer clock by synchronized and that the
itnerval since the peer clock was last updated is positive and less than
NTP.MAXAGE, test 7 that the host has no lower stratum than the server, and 8
that the header contains reasonable values for rootdelay and rootdispersion.

Ie, if any of the tests are out the server's time is too suspect to use. Your
server is problematic.



I marked some lines that look suspicious to me. In in the first line, some test seems to fail [110]. In
the 2nd line, suddenly
the offset jumps up to -1.9 seconds.
For the 3rd line, from the documentation I found this:

Leap status: ? means the remote computer is not currently synchronised.
5678: Tests for maximum delay, maximum delay ratio and maximum delay dev ratio, against defined
parameters, and a test for
synchronisation loop (1=pass, 0=fail) [1111]

So that would mean that delay ratio and delay dev ratio is bad...not sure what that really means
though.


It means that the round trip time is out of spec. And it looks like your
source, whatever that is, went nuts and jumped by 2 sec suddenly. that is why
you should be using at least 3 sources. If one goes mad, the others can
outvote it. If you only use 1, then its time is by definition right, even if
it is out by 40 years.




I find this very hard to debug, is there any other thing I could do to find out what's exactly going
wrong?


Talk to the person who is sysadmin on the server and find out what happened. Use more than one server.



Thanks for your answers in advance,

Ulrich







--
Mit freundlichen Grüßen / Best regards


Ulrich Schwesinger

Autonomous Systems Lab
Institute of Robotics and Intelligent Systems
ETH Zürich
LEE J301, Leonhardstrasse 21
CH-8092 Zürich, Switzerland

Phone: +41 44 633 93 73
Fax: +41 44 632 11 81
E-Mail: ulrich.schwesinger@xxxxxxxxxxxx
WWW: www.asl.ethz.ch

Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/