Re: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-dev Archives
]
- To: "chrony-dev@xxxxxxxxxxxxxxxxxxxx" <chrony-dev@xxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation
- From: "Franke, Daniel" <dff@xxxxxxxxxx>
- Date: Thu, 9 Jun 2022 17:32:50 +0000
- Accept-language: en-US
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1654795986; x=1686331986; h=from:to:subject:date:message-id:content-id: content-transfer-encoding:mime-version; bh=G4nWMFfUpcUjmKha0iZTHFv/zeLvjjxZyob2bjewN2M=; b=NNGL93K9SJTAgWHFe00YkcN8yPWSMhbbvXWPqqgJ75L8qupbbw5ml954 jUboe/sEQno3Ssy+dpMKisu0NHEec+W9L+KoWmH1RjHXPmnOXakzTPuFa w1yDCwi2nPeDyguk/y4QgUaFAzzwRLInGro8LZ+58WOs7B7fSo2yX6WyL 0=;
- Thread-index: AQHYfCbx02bSoB21bkOtUcrruKNmHQ==
- Thread-topic: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation
Setting the offset field in adjtimex doesn't step the clock, it just lets the kernel take charge of doing the slew. `offset` and `freq` can be used in conjunction with each other, with the former being used to correct offsets and the latter being used to correct actual frequency errors. Nonetheless, like I said in my original message, I understand that this would be a fairly dramatic and somewhat non-portable change in chrony's architecture, which is why I offered the simpler and adequate alternative of after-the-fact detection and backoff.
On 6/9/22, 1:20 PM, "Bill Unruh" <unruh@xxxxxxxxxxxxxx> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
The wild oscillation that you saw ocurred in pathological sitautions,
where the clock reading from the system were probably way way out from the
actual time or the timers became swamped, because of the swapping completely dominating the computer.
Chrony does not suffer from wild oscillation normally. When chrony crashed, it
could no longer stop the clock's drift due to corrections to non-existant bad
times due to swapping. For chrony to duplicate badly (as ntpd does) a change
in the drift by trying to mimic a drift by constantly stepping the clock is
surely not a way forward. Yes, some operating systems do not have a kernel
way of changing the rate of a clock and one has to do it badly. Surely it is
not a good idea to do it badly on one that does. Surely the key purpose of
chrony is to do the best job possible on the system which is working, not
subsume everything to fixing problems which may arise in a badly setup system
(and a system which goes into terminal swapping is badly set up).
William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ unruh@xxxxxxxxxxxxxx
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
On Thu, 9 Jun 2022, Franke, Daniel wrote:
> [CAUTION: Non-UBC Email]
>
> The patch I referenced just fixes the crash. It doesn't fix the oscillations. Ongoing wild oscillations in clock frequency may not be as bad a problem as fast linear error accumulation, but they're still a problem.
>
> On 6/9/22, 12:22 PM, "Bill Unruh" <unruh@xxxxxxxxxxxxxx> wrote:
>
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> You are suggesting "improvements" to a chrony misbehaviour that no longer
> exists in the newer versions. Use a newer version and see if you can duplicate
> the problem.
> Fixing non-existant problems is sure to introduce new problems.
>
>
>
>
> William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
> Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
> UBC, Vancouver,BC _|_ Program in Cosmology |____ unruh@xxxxxxxxxxxxxx
> Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
>
> On Thu, 9 Jun 2022, Franke, Daniel wrote:
>
> > [CAUTION: Non-UBC Email]
> >
> > I recently observed some pathological behavior by chrony on a system that was thrashing under memory pressure. The system was running an older version of chrony which didn't have https://git.tuxfamily.org/chrony/chrony.git/commit/?id=59e8b790341f344e07cb4d5124e7dc89de6665a1, and underwent a failure mode substantially identical to the one in Gruener's original report which motivated that patch. Chrony was configured with a short polling interval, the thrashing caused long delays in chrony getting scheduled, and the backup of timeout events triggered the false-positive infinite loop detection and chrony crashed. Running a fully-patched version of chrony would have prevented the crash, but what's interesting is what happened afterward: the clock drifted by several minutes over the course of less than an hour, suggesting that at the time of the crash, chrony was slewing the clock at a rate at or approaching the 83333ppm limit imposed by `maxslewrate`. What I believe happened is that scheduling delays were causing chrony's clock slews to get applied for more than double the intended time, so that the clock overshot and ended up off by more than it began, in the opposite direction. Because chrony applies large corrections more quickly and aggressively than small ones, this created a positive-feedback death spiral of increasingly large slews requiring decreasingly long delays to perpetuate the oscillation.
> >
> > This is clearly not good behavior, and there are a couple ways it could be improved. On systems where `adjtimex` is available, clock slews can be performed by manipulating the `offset` field rather than `freq` and `tick` (just like ntpd does). This would hand off the job of stopping the slew at the appropriate time to the kernel and completely prevent this kind of overshoot. On systems like OpenBSD where you only have `adjfreq` or similar (or everywhere, if you think my first suggestion is too extreme a change), chrony could at least detect the overshoot after the fact and temporarily back off on the maximum slew rate to prevent the oscillation from perpetuating. For example, if it's detected that any of the last several slews got applied for t seconds longer than intended, don't plan to apply the next slew for any less than k*t seconds, for some k>1.
> >
> >
>
>
N������y隊W!���z�������jh�ʊ�a�{.n�����������^���j)\��'�������'��}���*+�����)�.n7��:蹹^f��X��f����'��}���*+