[chrony-users] RTC Trimming Issues |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-users Archives
]
Hello all,
I administer several hundred embedded
systems that are running a custom live spin of Fedora and consequently
chrony-1.26-5.20110831gitb088b7.fc16.i686 and all share a common configuration
enforced by puppet. Out of all of these, a little over a hundred
run atop PC/104 hardware made by Advantech, model PCM-3353 and out of all
of those I have a dozen or so that just cannot seem to get their hardware
clocks set. When we deployed these we did nothing to tune the BIOS,
not even setting the date/time since I knew our NTP processes should be
able to do this better. Originally we were running ntpd, but switched
to chronyd to follow Fedora's defaults; our RTC setting issues have persisted
all this time. I've checked the on-board batteries and found nothing
unusual, checked BIOS versions and found all to be the same.
The problematic dozen devices power
up with their clock set to December something 2003, which is apparently
the epoch in use by this BIOS. Most of the PCM-3353 boards boot up
with the hardware clock set correctly (or as close as can be reasonably
expected). I'm trying to determine why the dozen systems cannot seem
to do this correctly despite all things being seemingly equal. Looking
at one particular misfit, I see:
# grep chronyd /var/log/messages
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]:
chronyd version 1.26-20110831gitb088b7 starting
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]:
Linux kernel major=3 minor=3 patch=0
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]:
hz=100 shift_hz=7 freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833
max_tick_bias=1000
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]:
Frequency -243.568 +- 0.595 ppm read from /mnt/live/var/lib/chrony/drift
Dec 31 19:02:29 aos-000bab25fdfa chronyd[1434]:
Selected source 10.39.192.3
Dec 31 19:02:29 aos-000bab25fdfa chronyd[1434]:
System clock wrong by 310142834.657615 seconds, adjustment started
Oct 29 10:49:43 aos-000bab25fdfa chronyd[1434]:
System clock was stepped by 310142834.658 seconds
Oct 29 10:51:45 aos-000bab25fdfa chronyd[1434]:
Selected source 10.1.0.97
Oct 29 10:54:06 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.577 seconds
Oct 29 10:56:21 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.578 seconds
Oct 29 11:09:05 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.579 seconds
Oct 29 11:24:20 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.580 seconds
Oct 29 11:39:38 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.581 seconds
Oct 29 11:54:14 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.583 seconds
Oct 29 12:09:21 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.583 seconds
Oct 29 12:24:21 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.584 seconds
Oct 29 12:39:34 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.586 seconds
Oct 29 12:54:32 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.587 seconds
Oct 29 13:09:27 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.588 seconds
Oct 29 13:24:28 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.588 seconds
Oct 29 13:39:34 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.589 seconds
Oct 29 13:54:33 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.589 seconds
Oct 29 14:09:32 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.589 seconds
Oct 29 14:24:27 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.590 seconds
Oct 29 14:39:35 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
Oct 29 14:54:33 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
Oct 29 15:09:31 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
Oct 29 15:24:33 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
Oct 29 15:39:37 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
Oct 29 15:54:27 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
Oct 29 16:09:32 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
Oct 29 16:24:33 aos-000bab25fdfa chronyd[1434]:
Trimming RTC, error = -310142834.591 seconds
What strikes me as most bizarre is that
chronyd sees the error and allegedly corrects the difference. If
that's true, why would it see the same correction being needed a few minutes
later? I have seen some of these do as above ("Trimming RTC,
error = <BIG NUMBER> seconds") repeated for a few dozen times
suddenly to be followed by "Trimming RTC, error = <SMALL NUMBER>
seconds" where SMALL is right around one second. Most, if not
all, of the non-misfit PCM-3353 boards show "Trimming RTC, error =
<SMALL NUMBER> seconds" and at one time I had assumed that just
meant the RTC had drifted by a second again and was once again due for
a trimming. I'm now inclined to believe that maybe these are struggling
to work correctly too, but they just need a small adjustment now, rather
than a big one. For example, this one:
# grep chrony /var/log/messages
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]:
chronyd version 1.26-20110831gitb088b7 starting
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]:
Linux kernel major=3 minor=3 patch=0
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]:
hz=100 shift_hz=7 freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833
max_tick_bias=1000
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]:
Frequency -7.062 +- 0.093 ppm read from /mnt/live/var/lib/chrony/drift
Oct 18 10:16:36 aos-000bab25e0f8 chronyd[1414]:
Selected source 10.11.192.3
Oct 18 10:16:36 aos-000bab25e0f8 chronyd[1414]:
System clock wrong by -0.734176 seconds, adjustment started
Oct 18 09:18:38 aos-000bab25e0f8 chronyd[1414]:
Selected source 10.1.0.97
Oct 18 09:22:40 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.171 seconds
Oct 19 05:52:31 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.006 seconds
Oct 19 20:52:38 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.005 seconds
Oct 20 11:52:31 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.011 seconds
Oct 21 04:37:37 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.001 seconds
Oct 21 15:22:29 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.014 seconds
Oct 22 01:07:36 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.017 seconds
Oct 22 22:37:36 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.016 seconds
Oct 23 08:22:30 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.006 seconds
Oct 23 17:37:40 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.010 seconds
Oct 24 11:37:37 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.003 seconds
Oct 25 06:52:31 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.008 seconds
Oct 25 14:37:37 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.009 seconds
Oct 26 07:07:37 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.011 seconds
Oct 27 00:37:35 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.009 seconds
Oct 27 20:22:30 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.011 seconds
Oct 28 03:52:31 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.009 seconds
Oct 28 13:22:31 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.012 seconds
Oct 29 02:37:37 aos-000bab25e0f8 chronyd[1414]:
Trimming RTC, error = 1.015 seconds
We've been tracking the misfit boards
a little more closely as of late (the problem has been ongoing for a few
years) and it seems to be the same dozen which naturally makes me suspect
of that hardware, but as I've said the batteries check out fine and everything
else to our eyes is identical and these boards work fine otherwise. The
problem spans several Fedora releases and I can say for certain we saw
the problem with chrony-1.24-4.20100428git73d775.fc13.i686 also. The
problem was also present with ntpd, although I don't recall if ntpd tries
to fix the RTC as does chronyd. I think our config is sane because
most devices have no such problems.
I've also tried tackling this with a
cron job that has this at its core:
adjust_rtc() {
chronyc <<EOF
password $(awk '{print $2}' /etc/chrony.keys)
trimrtc
EOF
echo 'hardware RTC adjusted
to system clock'
}
This function is only called if the
system clock is synchronized, which I determine by "chronyc -n sources"
showing an asterisk before a time-servers IP address.
Any ideas how to get these RTCs to do
the right thing and join us in this decade?
--
John Florian