RE: [chrony-dev] PPS reference clock rejected because of high dispersion

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-dev Archives ]


> > I've done further testing and investigations, and was able to cook
up
> > a patch that prevents this situation. In short, the patch will
reject
> > PPS pulses when the last sample from the locked ref clock is an
> outlier:
> 
> > -    offset += shift;
> > +    if (ref_dispersion >= 0.5 / rate)
> > +      return 0;
> >
> > -    if (fabs(ref_offset - offset) + ref_dispersion + dispersion >=
> 0.2
> > / rate)
> > +    if (fabs(ref_offset - offset) >= 0.5 / rate)
> >        return 0;
> 
> > The original alignment code is removed. Instead I check first if the
> > dispersion of the ref clock is smaller than half the rate, otherwise
> > you cannot reliable align the pps anymore to the refclock.
> 
> I think the alignment code is necessary to allow offsets larger than 1
> second, if the code is removed the PPS offset could be off by a whole
> number of seconds and chronyd will not be able to correct a large
> initial offset on start. Also, I'm not sure if we want to allow
locking
> to a source if the two dispersion together are larger than 0.5, the
> offset could be again off by a number of seconds.

The PPS clock will indeed not correct a large offset, but the refclock
(which is also listed as a valid source in the configuration) that is
locked to the PPS will be accepted and correct any initial large offset.
After that has happened the PPS clock should be within half a second of
the refclock and will be accepted. So I don't think the alignment code
is needed for this situation.

I can imagine you will need alignment if the refclock itself is not used
as a valid source (I assume I can achieve that by setting noselect to
the refclock source). What we also could do is that the alignment only
happens when the refclock does not partake as a valid source itself, and
that we disable alignment when both PPS and refclock are valid sources. 

In our situation both clocks should be valid sources, as on some of our
products the PPS is not wired at all and only the SHM source is valid.
Chrony would never get a valid sample in that case as SHM is not
selected, and PPS does not produce any samples.

> If you think the 0.2 second limit is too restrictive, we could
increase
> it a bit, but not by much to avoid the incorrect alignment.

In a normal situation the 0.2 factor is fine, but the problem in my case
is that the refclock samples were seconds off. To have them accepted the
factor would have been 15 seconds, which defeats its purpose.

> > In the old situation the sample would be aligned using a shift, but
> > that actually caused the PPS sample to become an outlier as well and
> > it would increase dispersion of the PPS a lot. And in the old check
> > where ref_dispersion and dispersion are used (refclock.c:421), the
> > increased dispersion alone would cause all subsequent samples to be
> rejected.
> 
> I think that's all right, we don't want to use the PPS sample unless
we
> can be sure the sources are so stable that the PPS second will be
> aligned correctly.
> 
> If your SHM source is not very stable, you might want to remove the
> lock and noselect options, increase the poll option for the SHM
source,
> let chronyd synchronize to SHM first and lock PPS to the system clock
> instead of the SHM source. The lock option was intended to be used
only
> with stable sources.
 
The reason why I lock the PPS to the SHM clock is that I don't want the
PPS samples to be accepted at all when no valid SHM samples come in.
Both clocks are published by the same GNSS card. The PPS samples will
still be published by the card when it has no lock, but the SHM samples
will be rejected. In this case the PPS samples should also be rejected
as they are free-running.

In the future we might have the scenario that we use two GNSS cards to
discipline the clock, so two SHM and two PPS sources. But if one GNSS
card is in lock (SHM1 with PPS1 for example), but the second card is not
(SHM2 with PPS2), then when I do not explicitly lock PPS2 to SHM2 chrony
could decide to still use PPS2 as a valid source, because SHM1 is
providing valid samples. And PPS2 is drifting away, causing chrony to
drift as well. This I would like to prevent.

I tried the scenario where the PPS is locked to the system clock. When I
generate SHM outliers the root dispersion of the system clock increases
a lot, and although the PPS is still accepted it will probably take
hours before the root dispersion is lowered again to the dispersion of
the PPS. So even though the PPS is very stable (20 us dispersion), the
root dispersion published to other NTP clients is high (100 ms
sometimes). When PPS is locked to the SHM clock the root dispersion
published is much closer to the PPS dispersion, which I think is a more
close to the truth.

> > And
> > as the filter is never updated the dispersion never became lower. So
> a
> > deadlock, which matches my bug report.
> 
> That's the bug we need to fix, but I liked better your original
> suggestion to reset the filter and the variance statistic when the
> check fails (maybe not always, but after some number of times).
> 

- Quote from your second email:
>
> Here is another idea. Don't include the PPS dispersion in the
> original check, so it will continue to collect new samples and
> update the variance statistic, but add a new check with the PPS
> dispersion and set a flag to not accumulate the filtered sample
> in poll_timeout()> until the dispersion is good.

Agreed, this is a bug. But resetting the variance statistics is a harsh
way to circumvent this. Somehow the variance statistics should still be
updated, even if the sample itself is rejected. Your suggestion should
do this, so this we could try.

When outliers are inserted in the variance statistics filter, are those
rejected or always accepted by the filter? If they would be rejected the
filter would jump that much in the beginning and publishing a high
dispersion for such a long time.

To conclude my response, I still prefer my solution where PPS samples
are rejected when the corresponding SHM samples look like outliers as
they are more than 0.5 seconds off. Even if the SHM driver is temporary
unstable the PPS is still stable, and its dispersion should not be
affected by the SHM driver. I could add the feature that when the SHM
driver itself is not selectable the alignment code is activated again,
as I suggested earlier in this mail.

Greetings,

Tjalling Hattink


--
To unsubscribe email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject.
For help email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/