[chrony-dev] PPS reference clock rejected because of high dispersion |
[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-dev Archives ]
Hi, I've been using chrony for 4 years on our GPS boxes, and it was always providing a reliable service. I used version 1.24, the first release supporting reference clocks. As we recently upgraded our Linux kernel to 3.10, we have to upgrade chrony as well. Version 1.24 is not supporting the 3.x kernels. So I upgraded the chrony package to version 1.29.1 and installed it on our test units. Unfortunately I ran into a problem. I'm using the following scenario: One reference clock based on the shared memory interface. An external tools puts absolute timestamps in this shared memory based on ZDA messages received from a GPS card. Two reference clocks based on the PPS interface. One of those interfaces receives PPS pulses from the GPS card, the other interface is idle. The reason for two interfaces is that there are two possible hardware PPS connections to the GPS card, but only one is actually wired. The problem is that the PPS reference clock is rejected after 32 seconds and the system falls back to the shared memory reference clock. It never recovered from this state. I found this problem after 3 days. I gathered the following extra information: -------------------- chronyc -m sources sourcestats 210 Number of sources = 3 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== #* SPMC 0 2 377 5 +479us[ +583us] +/- 200ms #? PPSI 0 2 0 67h +11.2s[ +11.1s] +/- 14.8s #? PPSE 0 2 0 10y +0ns[ +0ns] +/- 0ns 210 Number of sources = 3 Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev ============================================================================== SPMC 24 14 91 0.125 17.761 +381ns 590us PPSI 39 20 294 0.984 1.358 -9961us 109us PPSE 0 0 0 0.000 2000.000 +0ns 4000ms -------------------- chrony.conf: # driftfile driftfile /mnt/maindrive/cache/chrony.drift # dump measurements dir dumpdir /mnt/maindrive/cache/chrony.dump dumponexit # allow access from any NTP client allow # make chrony always step when clock is off by 10 seconds makestep 10 -1 # rtc configuration rtcdevice /dev/rtc0 rtconutc rtcfile /mnt/maindrive/cache/chrony.rtc # chronyc security commandkey 1 keyfile /etc/chrony.keys # reference clocks refclock SHM 0 refid SPMC poll 2 delay 0.4 refclock PPS /dev/pps0 refid PPSI poll 2 lock SPMC refclock PPS /dev/pps1 refid PPSE poll 2 lock SPMC -------------------- syslog: Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: chronyd version 1.29.1 starting Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: Linux kernel major=3 minor=10 patch=34 Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: hz=100 shift_hz=7 freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833 m Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: Frequency -71.319 +/- 0.005 ppm read from /mnt/maindrive/cache/chrony.drift Fri May 9 11:57:10 2014 daemon.warn chronyd[1675]: NTP packet received from unauthorised host 192.168.63.159 port 4669 Fri May 9 11:57:17 2014 daemon.info chronyd[1675]: System trim from RTC = -0.642922 Fri May 9 11:57:24 2014 daemon.info chronyd[1675]: Selected source SPMC Fri May 9 11:57:24 2014 daemon.info chronyd[1675]: Selected source PPSI Fri May 9 11:57:56 2014 daemon.info chronyd[1675]: Selected source SPMC -------------------- I also attached a debugger to the chrony process to find out more. I saw that in refclock.c:423 (RCL_AddPulse) the if statement fails and returns 0 because the dispersion value is more than 11 seconds. This value came from the function call filter_get_avg_sample_dispersion in refclock.c:386 Then I set the filter->index on -1 and filter->used on 0 for instance->filter, but that didn't help. Then I set filter->avg_var on 0, and finally chrony started to accept samples again from the PPS refclock. My first question is, how is it possible that this value became so big? Is it caused by a delayed zda message used for disciplining the SPMC refclock? Or could this be a bug in the median_filter? And how can we prevent/resolve this situation? Shouldn't the median filter be reset/cleared when it gets into a situation like this? I see that filter_reset is called in refclock.c:450, which is used when the PPS is not specifically locked to another clock, so should it be called at refclock.c:424 as well? And looking at filter_reset, it doesn't clear the avg_var and avg_var_n values. Shouldn't those be cleared here? I hope you guys can shed some light on this! Kind regards, Tjalling Hattink |
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |