RE: [chrony-dev] PPS reference clock rejected because of high dispersion

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-dev Archives ]


> 
> > My first question is, how is it possible that this value became so
> big?
> > Is it caused by a delayed zda message used for disciplining the SPMC
> > refclock? Or could this be a bug in the median_filter?
> 
> I'm not sure, I'd guess some SHM samples were bad and the PPS samples
> were aligned to incorrect seconds, which resulted in the large
> dispersion.
> 
> Can you please enable the refclocks log so we can the individual SHM
> and PPS samples?

I have been able to reproduce this by inserting outliers through the SHM
interface. The samples are so delayed that the offset is multiple
seconds. This is caused by our software stalling because of cpu power
shortage. Here is a snippet of the refclocks log in that situation:

========================================================================
=======
   Date (UTC) Time         Refid  DP L P  Raw offset   Cooked offset
Disp.
========================================================================
=======
2014-05-15 11:47:03.988805 SPMC    0 N 0  1.118600e-02  1.119526e-02
5.085e-06
2014-05-15 11:47:04.999991 PPSI    0 N 1 -2.390000e-07  9.018000e-06
4.086e-06
2014-05-15 11:47:04.987737 SPMC    1 N 0  1.225600e-02  1.226294e-02
5.085e-06
2014-05-15 11:47:05.999990 PPSI    1 N 1  2.731000e-06  9.673000e-06
4.086e-06
2014-05-15 11:47:06.214993 SPMC    2 N 0 -2.149980e-01 -2.149928e-01
3.735e-06
2014-05-15 11:47:06.999991 PPSI    2 N 1  3.494000e-06  8.701000e-06
2.736e-06
2014-05-15 11:47:07.999987 PPSI    3 N 1  8.520000e-06  1.242500e-05
2.302e-06
2014-05-15 11:47:05.499991 PPSI    - N -       -        9.345500e-06
3.086e-06
2014-05-15 11:47:12.794035 SPMC    0 N 0 -4.794036e+00 -4.794035e+00
2.304e-06
2014-05-15 11:47:12.999989 PPSI    0 N 1 -4.999990e+00 -4.999990e+00
7.682e-07
2014-05-15 11:47:13.596917 SPMC    1 N 0 -3.596918e+00 -3.596917e+00
2.228e-06
2014-05-15 11:47:13.999987 PPSI    1 N 1 -3.999989e+00 -3.999988e+00
6.922e-07
2014-05-15 11:47:14.767365 SPMC    2 N 0 -7.673660e-01 -7.673655e-01
2.171e-06
2014-05-15 11:47:14.999986 PPSI    2 N 1 -9.999877e-01 -9.999872e-01
6.352e-07
2014-05-15 11:47:15.309933 SPMC    3 N 0 -3.099330e-01 -3.099326e-01
2.128e-06
2014-05-15 11:47:14.182141 SPMC    - N -       -       -2.182141e+00
2.001e+00
2014-05-15 11:47:15.999988 PPSI    3 N 1  1.166100e-05  1.204600e-05
5.922e-07
2014-05-15 11:47:14.499987 PPSI    - N -       -       -2.499988e+00
2.121e+00
2014-05-15 11:47:16.997060 SPMC    0 N 0  2.940000e-03  2.940168e-03
2.001e+00
2014-05-15 11:47:18.989310 SPMC    2 N 0  1.069000e-02  1.069009e-02
2.001e+00
2014-05-15 11:47:19.989252 SPMC    3 N 0  1.074800e-02  1.074807e-02
2.001e+00
2014-05-15 11:47:20.989338 SPMC    0 N 0  1.066200e-02  1.066205e-02
2.001e+00
2014-05-15 11:47:21.989151 SPMC    1 N 0  1.084900e-02  1.084904e-02
2.001e+00
2014-05-15 11:47:24.008873 SPMC    3 N 0 -8.873000e-03 -8.872978e-03
2.001e+00
2014-05-15 11:47:20.489295 SPMC    - N -       -        1.070506e-02
1.415e+00
2014-05-15 11:47:25.037856 SPMC    0 N 0 -3.785600e-02 -3.785598e-02
1.415e+00
2014-05-15 11:47:25.989173 SPMC    1 N 0  1.082700e-02  1.082701e-02
1.415e+00
2014-05-15 11:47:26.988122 SPMC    2 N 0  1.187800e-02  1.187801e-02
1.415e+00
2014-05-15 11:47:28.094069 SPMC    3 N 0 -9.406900e-02 -9.406899e-02
1.415e+00
2014-05-15 11:47:25.513515 SPMC    - N -       -       -1.351449e-02
1.155e+00
2014-05-15 11:47:29.136062 SPMC    0 N 0 -1.360620e-01 -1.360620e-01
1.155e+00
2014-05-15 11:47:30.068753 SPMC    1 N 0 -6.875300e-02 -6.875300e-02
1.155e+00

> 
> > And how can we prevent/resolve this situation? Shouldn't the median
> > filter be reset/cleared when it gets into a situation like this? I
> see
> > that filter_reset is called in refclock.c:450, which is used when
the
> > PPS is not specifically locked to another clock, so should it be
> > called at refclock.c:424 as well?
> 
> Possibly.
> 
> > And looking at filter_reset, it doesn't clear the avg_var and
> > avg_var_n values. Shouldn't those be cleared here?
> 
> The avg_var is a long-term statistic that needs to be preserved,
> filter_reset prepares the filter for the next batch of samples, but we
> can rename it to filter_clean and filter_reset would reset also the
the
> variance stat.
> 

I've done further testing and investigations, and was able to cook up a
patch that prevents this situation. In short, the patch will reject PPS
pulses when the last sample from the locked ref clock is an outlier:

--- a/refclock.c
+++ b/refclock.c
@@ -400,7 +400,7 @@ RCL_AddPulse(RCL_Instance instance, stru
 
   if (instance->lock_ref != -1) {
     struct timeval ref_sample_time;
-    double sample_diff, ref_offset, ref_dispersion, shift;
+    double sample_diff, ref_offset, ref_dispersion;
 
     if (!filter_get_last_sample(&refclocks[instance->lock_ref].filter,
           &ref_sample_time, &ref_offset, &ref_dispersion))
@@ -410,15 +410,10 @@ RCL_AddPulse(RCL_Instance instance, stru
     if (fabs(sample_diff) >= 2.0 / rate)
       return 0;
 
-    /* Align the offset to the reference sample */
-    if ((ref_offset - offset) >= 0.0)
-      shift = (long)((ref_offset - offset) * rate + 0.5) /
(double)rate;
-    else
-      shift = (long)((ref_offset - offset) * rate - 0.5) /
(double)rate;
-
-    offset += shift;
+    if (ref_dispersion >= 0.5 / rate)
+      return 0;
 
-    if (fabs(ref_offset - offset) + ref_dispersion + dispersion >= 0.2
/ rate)
+    if (fabs(ref_offset - offset) >= 0.5 / rate)
       return 0;
 
 #if 0

The original alignment code is removed. Instead I check first if the
dispersion of the ref clock is smaller than half the rate, otherwise you
cannot reliable align the pps anymore to the refclock.

Then I check if the difference between the offsets of the PPS and the
locked refclock is not more than half the pps pulse length. If it is
more, I reject the sample (I think it is an outlier).

In the old situation the sample would be aligned using a shift, but that
actually caused the PPS sample to become an outlier as well and it would
increase dispersion of the PPS a lot. And in the old check where
ref_dispersion and dispersion are used (refclock.c:421), the increased
dispersion alone would cause all subsequent samples to be rejected. And
as the filter is never updated the dispersion never became lower. So a
deadlock, which matches my bug report.

What do you think of this patch? We are going to use it in our code base
until either a better solution is found, although I'm quite happy with
the current one.

Best regards,

Tjalling Hattink

--
To unsubscribe email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject.
For help email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/