[chrony-dev] PPS reference clock rejected because of high dispersion

Hi,

I've been using chrony for 4 years on our GPS boxes, and it was always providing a reliable service. I used version 1.24, the first release supporting reference clocks.

As we recently upgraded our Linux kernel to 3.10, we have to upgrade chrony as well. Version 1.24 is not supporting the 3.x kernels. So I upgraded the chrony package to version 1.29.1 and installed it on our test units.

Unfortunately I ran into a problem. I'm using the following scenario:

One reference clock based on the shared memory interface. An external tools puts absolute timestamps in this shared memory based on ZDA messages received from a GPS card.

Two reference clocks based on the PPS interface. One of those interfaces receives PPS pulses from the GPS card, the other interface is idle. The reason for two interfaces is that there are two possible hardware PPS connections to the GPS card, but only one is actually wired.

The problem is that the PPS reference clock is rejected after 32 seconds and the system falls back to the shared memory reference clock. It never recovered from this state. I found this problem after 3 days. I gathered the following extra information:

--------------------

chronyc -m sources sourcestats

210 Number of sources = 3

MS Name/IP address Stratum Poll Reach LastRx Last sample

===============================================================================

#* SPMC 0 2 377 5 +479us[ +583us] +/- 200ms

#? PPSI 0 2 0 67h +11.2s[ +11.1s] +/- 14.8s

#? PPSE 0 2 0 10y +0ns[ +0ns] +/- 0ns

210 Number of sources = 3

Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev

==============================================================================

SPMC 24 14 91 0.125 17.761 +381ns 590us

PPSI 39 20 294 0.984 1.358 -9961us 109us

PPSE 0 0 0 0.000 2000.000 +0ns 4000ms

--------------------

chrony.conf:

# driftfile

driftfile /mnt/maindrive/cache/chrony.drift

# dump measurements dir

dumpdir /mnt/maindrive/cache/chrony.dump

dumponexit

# allow access from any NTP client

allow

# make chrony always step when clock is off by 10 seconds

makestep 10 -1

# rtc configuration

rtcdevice /dev/rtc0

rtconutc

rtcfile /mnt/maindrive/cache/chrony.rtc

# chronyc security

commandkey 1

keyfile /etc/chrony.keys

# reference clocks

refclock SHM 0 refid SPMC poll 2 delay 0.4

refclock PPS /dev/pps0 refid PPSI poll 2 lock SPMC

refclock PPS /dev/pps1 refid PPSE poll 2 lock SPMC

--------------------

syslog:

Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: chronyd version 1.29.1 starting

Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: Linux kernel major=3 minor=10 patch=34

Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: hz=100 shift_hz=7 freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833 m

Fri May 9 11:57:08 2014 daemon.info chronyd[1675]: Frequency -71.319 +/- 0.005 ppm read from /mnt/maindrive/cache/chrony.drift

Fri May 9 11:57:10 2014 daemon.warn chronyd[1675]: NTP packet received from unauthorised host 192.168.63.159 port 4669

Fri May 9 11:57:17 2014 daemon.info chronyd[1675]: System trim from RTC = -0.642922

Fri May 9 11:57:24 2014 daemon.info chronyd[1675]: Selected source SPMC

Fri May 9 11:57:24 2014 daemon.info chronyd[1675]: Selected source PPSI

Fri May 9 11:57:56 2014 daemon.info chronyd[1675]: Selected source SPMC

--------------------

I also attached a debugger to the chrony process to find out more. I saw that in refclock.c:423 (RCL_AddPulse) the if statement fails and returns 0 because the dispersion value is more than 11 seconds. This value came from the function call filter_get_avg_sample_dispersion in refclock.c:386

Then I set the filter->index on -1 and filter->used on 0 for instance->filter, but that didn't help. Then I set filter->avg_var on 0, and finally chrony started to accept samples again from the PPS refclock.

My first question is, how is it possible that this value became so big? Is it caused by a delayed zda message used for disciplining the SPMC refclock? Or could this be a bug in the median_filter?

And how can we prevent/resolve this situation? Shouldn't the median filter be reset/cleared when it gets into a situation like this? I see that filter_reset is called in refclock.c:450, which is used when the PPS is not specifically locked to another clock, so should it be called at refclock.c:424 as well?

And looking at filter_reset, it doesn't clear the avg_var and avg_var_n values. Shouldn't those be cleared here?

I hope you guys can shed some light on this!

Kind regards,

Tjalling Hattink