Re: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-dev Archives
]
- To: "chrony-dev@xxxxxxxxxxxxxxxxxxxx" <chrony-dev@xxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation
- From: "Franke, Daniel" <dff@xxxxxxxxxx>
- Date: Thu, 9 Jun 2022 16:41:37 +0000
- Accept-language: en-US
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1654792911; x=1686328911; h=from:to:subject:date:message-id:content-id: content-transfer-encoding:mime-version; bh=FtlHLuJYKdy6OklJ46htukZvAaw4uwIItBxul+FrWHA=; b=Kr0kfe7Tn6wEH/RyiKj+zvlnzDEaDQ62Lv5NIow+xpcDvHAxL4BAfe+p DGDd1bgbDb/QS/TlAjlNwECYlBkhUKxTwW/DbgyZGuL5ko3sV5r1nbwvB kAyqf/IfizAZcZjmRLYfUCSiFzcTjG9rGusae6olu/lfYCi71JbpguZ9q w=;
- Thread-index: AQHYfB/J02bSoB21bkOtUcrruKNmHQ==
- Thread-topic: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation
The patch I referenced just fixes the crash. It doesn't fix the oscillations. Ongoing wild oscillations in clock frequency may not be as bad a problem as fast linear error accumulation, but they're still a problem.
On 6/9/22, 12:22 PM, "Bill Unruh" <unruh@xxxxxxxxxxxxxx> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
You are suggesting "improvements" to a chrony misbehaviour that no longer
exists in the newer versions. Use a newer version and see if you can duplicate
the problem.
Fixing non-existant problems is sure to introduce new problems.
William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ unruh@xxxxxxxxxxxxxx
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
On Thu, 9 Jun 2022, Franke, Daniel wrote:
> [CAUTION: Non-UBC Email]
>
> I recently observed some pathological behavior by chrony on a system that was thrashing under memory pressure. The system was running an older version of chrony which didn't have https://git.tuxfamily.org/chrony/chrony.git/commit/?id=59e8b790341f344e07cb4d5124e7dc89de6665a1, and underwent a failure mode substantially identical to the one in Gruener's original report which motivated that patch. Chrony was configured with a short polling interval, the thrashing caused long delays in chrony getting scheduled, and the backup of timeout events triggered the false-positive infinite loop detection and chrony crashed. Running a fully-patched version of chrony would have prevented the crash, but what's interesting is what happened afterward: the clock drifted by several minutes over the course of less than an hour, suggesting that at the time of the crash, chrony was slewing the clock at a rate at or approaching the 83333ppm limit imposed by `maxslewrate`. What I believe happened is that scheduling delays were causing chrony's clock slews to get applied for more than double the intended time, so that the clock overshot and ended up off by more than it began, in the opposite direction. Because chrony applies large corrections more quickly and aggressively than small ones, this created a positive-feedback death spiral of increasingly large slews requiring decreasingly long delays to perpetuate the oscillation.
>
> This is clearly not good behavior, and there are a couple ways it could be improved. On systems where `adjtimex` is available, clock slews can be performed by manipulating the `offset` field rather than `freq` and `tick` (just like ntpd does). This would hand off the job of stopping the slew at the appropriate time to the kernel and completely prevent this kind of overshoot. On systems like OpenBSD where you only have `adjfreq` or similar (or everywhere, if you think my first suggestion is too extreme a change), chrony could at least detect the overshoot after the fact and temporarily back off on the maximum slew rate to prevent the oscillation from perpetuating. For example, if it's detected that any of the last several slews got applied for t seconds longer than intended, don't plan to apply the next slew for any less than k*t seconds, for some k>1.
>
>