[chrony-users] RTC Trimming Issues

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


Hello all,

I administer several hundred embedded systems that are running a custom live spin of Fedora and consequently chrony-1.26-5.20110831gitb088b7.fc16.i686 and all share a common configuration enforced by puppet.  Out of all of these, a little over a hundred run atop PC/104 hardware made by Advantech, model PCM-3353 and out of all of those I have a dozen or so that just cannot seem to get their hardware clocks set.  When we deployed these we did nothing to tune the BIOS, not even setting the date/time since I knew our NTP processes should be able to do this better.  Originally we were running ntpd, but switched to chronyd to follow Fedora's defaults; our RTC setting issues have persisted all this time.  I've checked the on-board batteries and found nothing unusual, checked BIOS versions and found all to be the same.

The problematic dozen devices power up with their clock set to December something 2003, which is apparently the epoch in use by this BIOS.  Most of the PCM-3353 boards boot up with the hardware clock set correctly (or as close as can be reasonably expected).  I'm trying to determine why the dozen systems cannot seem to do this correctly despite all things being seemingly equal.  Looking at one particular misfit, I see:

#  grep chronyd /var/log/messages
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: chronyd version 1.26-20110831gitb088b7 starting
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: Linux kernel major=3 minor=3 patch=0
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: hz=100 shift_hz=7 freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833 max_tick_bias=1000                                                                                      
Dec 31 19:02:20 aos-000bab25fdfa chronyd[1434]: Frequency -243.568 +- 0.595 ppm read from /mnt/live/var/lib/chrony/drift                                                                                                                
Dec 31 19:02:29 aos-000bab25fdfa chronyd[1434]: Selected source 10.39.192.3
Dec 31 19:02:29 aos-000bab25fdfa chronyd[1434]: System clock wrong by 310142834.657615 seconds, adjustment started
Oct 29 10:49:43 aos-000bab25fdfa chronyd[1434]: System clock was stepped by 310142834.658 seconds
Oct 29 10:51:45 aos-000bab25fdfa chronyd[1434]: Selected source 10.1.0.97
Oct 29 10:54:06 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.577 seconds
Oct 29 10:56:21 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.578 seconds
Oct 29 11:09:05 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.579 seconds
Oct 29 11:24:20 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.580 seconds
Oct 29 11:39:38 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.581 seconds
Oct 29 11:54:14 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.583 seconds
Oct 29 12:09:21 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.583 seconds
Oct 29 12:24:21 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.584 seconds
Oct 29 12:39:34 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.586 seconds
Oct 29 12:54:32 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.587 seconds
Oct 29 13:09:27 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.588 seconds
Oct 29 13:24:28 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.588 seconds
Oct 29 13:39:34 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.589 seconds
Oct 29 13:54:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.589 seconds
Oct 29 14:09:32 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.589 seconds
Oct 29 14:24:27 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.590 seconds
Oct 29 14:39:35 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds
Oct 29 14:54:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds
Oct 29 15:09:31 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds
Oct 29 15:24:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds
Oct 29 15:39:37 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds
Oct 29 15:54:27 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds
Oct 29 16:09:32 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds
Oct 29 16:24:33 aos-000bab25fdfa chronyd[1434]: Trimming RTC, error = -310142834.591 seconds


What strikes me as most bizarre is that chronyd sees the error and allegedly corrects the difference.  If that's true, why would it see the same correction being needed a few minutes later?  I have seen some of these do as above ("Trimming RTC, error = <BIG NUMBER> seconds") repeated for a few dozen times suddenly to be followed by "Trimming RTC, error = <SMALL NUMBER> seconds" where SMALL is right around one second.  Most, if not all, of the non-misfit PCM-3353 boards show "Trimming RTC, error = <SMALL NUMBER> seconds" and at one time I had assumed that just meant the RTC had drifted by a second again and was once again due for a trimming.  I'm now inclined to believe that maybe these are struggling to work correctly too, but they just need a small adjustment now, rather than a big one.  For example, this one:

#  grep chrony /var/log/messages
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: chronyd version 1.26-20110831gitb088b7 starting
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: Linux kernel major=3 minor=3 patch=0
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: hz=100 shift_hz=7 freq_scale=1.00000000 nominal_tick=10000 slew_delta_tick=833 max_tick_bias=1000
Oct 18 10:16:28 aos-000bab25e0f8 chronyd[1414]: Frequency -7.062 +- 0.093 ppm read from /mnt/live/var/lib/chrony/drift
Oct 18 10:16:36 aos-000bab25e0f8 chronyd[1414]: Selected source 10.11.192.3
Oct 18 10:16:36 aos-000bab25e0f8 chronyd[1414]: System clock wrong by -0.734176 seconds, adjustment started
Oct 18 09:18:38 aos-000bab25e0f8 chronyd[1414]: Selected source 10.1.0.97
Oct 18 09:22:40 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.171 seconds
Oct 19 05:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.006 seconds
Oct 19 20:52:38 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.005 seconds
Oct 20 11:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.011 seconds
Oct 21 04:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.001 seconds
Oct 21 15:22:29 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.014 seconds
Oct 22 01:07:36 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.017 seconds
Oct 22 22:37:36 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.016 seconds
Oct 23 08:22:30 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.006 seconds
Oct 23 17:37:40 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.010 seconds
Oct 24 11:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.003 seconds
Oct 25 06:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.008 seconds
Oct 25 14:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.009 seconds
Oct 26 07:07:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.011 seconds
Oct 27 00:37:35 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.009 seconds
Oct 27 20:22:30 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.011 seconds
Oct 28 03:52:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.009 seconds
Oct 28 13:22:31 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.012 seconds
Oct 29 02:37:37 aos-000bab25e0f8 chronyd[1414]: Trimming RTC, error = 1.015 seconds

We've been tracking the misfit boards a little more closely as of late (the problem has been ongoing for a few years) and it seems to be the same dozen which naturally makes me suspect of that hardware, but as I've said the batteries check out fine and everything else to our eyes is identical and these boards work fine otherwise.  The problem spans several Fedora releases and I can say for certain we saw the problem with chrony-1.24-4.20100428git73d775.fc13.i686 also.  The problem was also present with ntpd, although I don't recall if ntpd tries to fix the RTC as does chronyd.  I think our config is sane because most devices have no such problems.

I've also tried tackling this with a cron job that has this at its core:

adjust_rtc() {
    chronyc <<EOF
password $(awk '{print $2}' /etc/chrony.keys)
trimrtc
EOF
    echo 'hardware RTC adjusted to system clock'
}

This function is only called if the system clock is synchronized, which I determine by "chronyc -n sources" showing an asterisk before a time-servers IP address.

Any ideas how to get these RTCs to do the right thing and join us in this decade?

--
John Florian




Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/