Hello all,
I administer several hundred embedded systems that are running a custom
live spin of Fedora and consequently
chrony-1.26-5.20110831gitb088b7.fc16.i686 and all share a common
configuration enforced by puppet. Out of all of these, a little over a
hundred run atop PC/104 hardware made by Advantech, model PCM-3353 and out
of all of those I have a dozen or so that just cannot seem to get their
hardware clocks set. When we deployed these we did nothing to tune the
BIOS, not even setting the date/time since I knew our NTP processes should
be able to do this better. Originally we were running ntpd, but switched
to chronyd to follow Fedora's defaults; our RTC setting issues have
persisted all this time. I've checked the on-board batteries and found
nothing unusual, checked BIOS versions and found all to be the same.
The problematic dozen devices power up with their clock set to December
something 2003, which is apparently the epoch in use by this BIOS. Most
of the PCM-3353 boards boot up with the hardware clock set correctly (or
as close as can be reasonably expected). I'm trying to determine why the
dozen systems cannot seem to do this correctly despite all things being
seemingly equal. Looking at one particular misfit, I see:
# grep chronyd /var/log/messages
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: chronyd version
1.26-20110831gitb088b7 starting
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: Linux kernel major=3
minor=3 patch=0
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: hz=100 shift_hz=7
freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833
max_tick_bias=1000
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: Frequency -243.568 +-
0.595 ppm read from /mnt/live/var/lib/chrony/drift
Dec 31 19:02:29 aos-000bab25fdfa chronyd[1434]: Selected source
10.39.192.3
Dec 31 19:02:29 aos-000bab25fdfa chronyd[1434]: System clock wrong by
310142834.657615 seconds, adjustment started
Oct 29 10:49:43 aos-000bab25fdfa chronyd[1434]: System clock was stepped
by 310142834.658 seconds
Oct 29 10:51:45 aos-000bab25fdfa chronyd[1434]: Selected source 10.1.0.97
Oct 29 10:54:06 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.577 seconds
Oct 29 10:56:21 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.578 seconds
Oct 29 11:09:05 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.579 seconds
Oct 29 11:24:20 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.580 seconds
Oct 29 11:39:38 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.581 seconds
Oct 29 11:54:14 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.583 seconds
Oct 29 12:09:21 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.583 seconds
Oct 29 12:24:21 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.584 seconds
Oct 29 12:39:34 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.586 seconds
Oct 29 12:54:32 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.587 seconds
Oct 29 13:09:27 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.588 seconds
Oct 29 13:24:28 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.588 seconds
Oct 29 13:39:34 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.589 seconds
Oct 29 13:54:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.589 seconds
Oct 29 14:09:32 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.589 seconds
Oct 29 14:24:27 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.590 seconds
Oct 29 14:39:35 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
Oct 29 14:54:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
Oct 29 15:09:31 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
Oct 29 15:24:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
Oct 29 15:39:37 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
Oct 29 15:54:27 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
Oct 29 16:09:32 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
Oct 29 16:24:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error =
-310142834.591 seconds
What strikes me as most bizarre is that chronyd sees the error and
allegedly corrects the difference. If that's true, why would it see the
same correction being needed a few minutes later? I have seen some of
these do as above ("Trimming RTC, error = <BIG NUMBER> seconds") repeated
for a few dozen times suddenly to be followed by "Trimming RTC, error =
<SMALL NUMBER> seconds" where SMALL is right around one second. Most, if
not all, of the non-misfit PCM-3353 boards show "Trimming RTC, error =
<SMALL NUMBER> seconds" and at one time I had assumed that just meant the
RTC had drifted by a second again and was once again due for a trimming.
I'm now inclined to believe that maybe these are struggling to work
correctly too, but they just need a small adjustment now, rather than a
big one. For example, this one:
# grep chrony /var/log/messages
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: chronyd version
1.26-20110831gitb088b7 starting
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: Linux kernel major=3
minor=3 patch=0
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: hz=100 shift_hz=7
freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833
max_tick_bias=1000
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: Frequency -7.062 +- 0.093
ppm read from /mnt/live/var/lib/chrony/drift
Oct 18 10:16:36 aos-000bab25e0f8 chronyd[1414]: Selected source
10.11.192.3
Oct 18 10:16:36 aos-000bab25e0f8 chronyd[1414]: System clock wrong by
-0.734176 seconds, adjustment started
Oct 18 09:18:38 aos-000bab25e0f8 chronyd[1414]: Selected source 10.1.0.97
Oct 18 09:22:40 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.171 seconds
Oct 19 05:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.006 seconds
Oct 19 20:52:38 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.005 seconds
Oct 20 11:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.011 seconds
Oct 21 04:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.001 seconds
Oct 21 15:22:29 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.014 seconds
Oct 22 01:07:36 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.017 seconds
Oct 22 22:37:36 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.016 seconds
Oct 23 08:22:30 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.006 seconds
Oct 23 17:37:40 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.010 seconds
Oct 24 11:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.003 seconds
Oct 25 06:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.008 seconds
Oct 25 14:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.009 seconds
Oct 26 07:07:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.011 seconds
Oct 27 00:37:35 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.009 seconds
Oct 27 20:22:30 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.011 seconds
Oct 28 03:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.009 seconds
Oct 28 13:22:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.012 seconds
Oct 29 02:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error =
1.015 seconds
We've been tracking the misfit boards a little more closely as of late
(the problem has been ongoing for a few years) and it seems to be the same
dozen which naturally makes me suspect of that hardware, but as I've said
the batteries check out fine and everything else to our eyes is identical
and these boards work fine otherwise. The problem spans several Fedora
releases and I can say for certain we saw the problem with
chrony-1.24-4.20100428git73d775.fc13.i686 also. The problem was also
present with ntpd, although I don't recall if ntpd tries to fix the RTC as
does chronyd. I think our config is sane because most devices have no
such problems.
I've also tried tackling this with a cron job that has this at its core:
adjust_rtc() {
chronyc <<EOF
password $(awk '{print $2}' /etc/chrony.keys)
trimrtc
EOF
echo 'hardware RTC adjusted to system clock'
}
This function is only called if the system clock is synchronized, which I
determine by "chronyc -n sources" showing an asterisk before a
time-servers IP address.
Any ideas how to get these RTCs to do the right thing and join us in this
decade?
--
John Florian