Re: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-dev Archives ]


On Thu, Jun 09, 2022 at 03:43:27PM +0000, Franke, Daniel wrote:
> What I believe happened is that scheduling delays were causing
> chrony's clock slews to get applied for more than double the
> intended time, so that the clock overshot and ended up off by more
> than it began, in the opposite direction. Because chrony applies
> large corrections more quickly and aggressively than small ones,
> this created a positive-feedback death spiral of increasingly large
> slews requiring decreasingly long delays to perpetuate the
> oscillation.

That sounds plausible to me. The minimum length of the slew is 1
second. The actual interval would need to be more than double for
the oscillations to amplify.

Can a system where processes are delayed by more than 1 second still
be doing something useful? If it was a temporary issue, I'd expect
chronyd to recover.

> This is clearly not good behavior, and there are a couple ways it could be improved. On systems where `adjtimex` is available, clock slews can be performed by manipulating the `offset` field rather than `freq` and `tick` (just like ntpd does). This would hand off the job of stopping the slew at the appropriate time to the kernel and completely prevent this kind of overshoot.

The singleshot adjustment (aka adjtime()) on Linux is too slow (500
ppm) to be useful for chronyd. It is used on some other systems like
FreeBSD, where it can go up to 5000 ppm while the ntp_adjtime()
frequency is limited to 500 ppm.

The ntp_adjtime() PLL offset could be used on Linux for slewing, and
some earlier versions of chronyd did that, but it has some issues that
is better to avoid even if it means the slew will overshoot when
chronyd is not able to stop it at the right time.

> On systems like OpenBSD where you only have `adjfreq` or similar (or
> everywhere, if you think my first suggestion is too extreme a
> change), chrony could at least detect the overshoot after the fact
> and temporarily back off on the maximum slew rate to prevent the
> oscillation from perpetuating. For example, if it's detected that
> any of the last several slews got applied for t seconds longer than
> intended, don't plan to apply the next slew for any less than k*t
> seconds, for some k>1.

It would need to avoid false positives, e.g. when the system is
suspended and resumed from disk/RAM. I'd prefer simplicity. There is
already some code detecting unexpected clock jumps larger than 10
seconds, which should reset almost everything, including currently
running slew. Maybe the extra slew interval could be limited to those
10 seconds. I'll look into that.

Thanks,

-- 
Miroslav Lichvar


-- 
To unsubscribe email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject.
For help email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/