[chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-dev Archives
]
- To: "chrony-dev@xxxxxxxxxxxxxxxxxxxx" <chrony-dev@xxxxxxxxxxxxxxxxxxxx>
- Subject: [chrony-dev] Pathological behavior of chrony's clock discipline algorithm under starvation
- From: "Franke, Daniel" <dff@xxxxxxxxxx>
- Date: Thu, 9 Jun 2022 15:43:27 +0000
- Accept-language: en-US
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1654789424; x=1686325424; h=from:to:subject:date:message-id:content-id: content-transfer-encoding:mime-version; bh=YOV5ISytLAuK+RIhT7SRIn6TF3t/gjlpXtHrxrjC2JE=; b=RpmHUuDRqGHtCUC9vglWpO4M6yI8CSZ8wXTnYiNNLpx42g/TIh7E6BcE zLNRZTEkJE4KaGOpH9kQWRZHR02OpMQVaUXm5tITbrlH42Kv+6mwJGlPr qzLVeIi9i7FuuVhXIAmpvyFdYiuVKc4B/dScfeFajCNPRiGL5LgGTe3YQ w=;
- Thread-index: AQHYfBep+B9c9IrXUUuKZIb3CuXpXA==
- Thread-topic: Pathological behavior of chrony's clock discipline algorithm under starvation
I recently observed some pathological behavior by chrony on a system that was thrashing under memory pressure. The system was running an older version of chrony which didn't have https://git.tuxfamily.org/chrony/chrony.git/commit/?id=59e8b790341f344e07cb4d5124e7dc89de6665a1, and underwent a failure mode substantially identical to the one in Gruener's original report which motivated that patch. Chrony was configured with a short polling interval, the thrashing caused long delays in chrony getting scheduled, and the backup of timeout events triggered the false-positive infinite loop detection and chrony crashed. Running a fully-patched version of chrony would have prevented the crash, but what's interesting is what happened afterward: the clock drifted by several minutes over the course of less than an hour, suggesting that at the time of the crash, chrony was slewing the clock at a rate at or approaching the 83333ppm limit imposed by `maxslewrate`. What I believe happened is that scheduling delays were causing chrony's clock slews to get applied for more than double the intended time, so that the clock overshot and ended up off by more than it began, in the opposite direction. Because chrony applies large corrections more quickly and aggressively than small ones, this created a positive-feedback death spiral of increasingly large slews requiring decreasingly long delays to perpetuate the oscillation.
This is clearly not good behavior, and there are a couple ways it could be improved. On systems where `adjtimex` is available, clock slews can be performed by manipulating the `offset` field rather than `freq` and `tick` (just like ntpd does). This would hand off the job of stopping the slew at the appropriate time to the kernel and completely prevent this kind of overshoot. On systems like OpenBSD where you only have `adjfreq` or similar (or everywhere, if you think my first suggestion is too extreme a change), chrony could at least detect the overshoot after the fact and temporarily back off on the maximum slew rate to prevent the oscillation from perpetuating. For example, if it's detected that any of the last several slews got applied for t seconds longer than intended, don't plan to apply the next slew for any less than k*t seconds, for some k>1.