Re: [chrony-users] RTC Trimming Issues

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


Ed W <lists@xxxxxxxxxxxxxx> wrote on 10/31/2012 08:54:46:
>
> On 30/10/2012 13:54, John.Florian@xxxxxxxx wrote:

>
> > - Can you set the correct (ish) time with hwclock?
>
> This looks broken too:
>
> # date
> Tue Oct 30 09:28:01 EDT 2012
> # hwclock --systohc
> hwclock: select() to /dev/rtc to wait for clock tick timed out: Success
> # hwclock --show
> hwclock: select() to /dev/rtc to wait for clock tick timed out: Success
>
> > When you reboot does it stay set?
>
> Given the above results, I suspect this is irrelevant, but just for
> grins, I rebooted it and found a "surprise inside":

>
> Hi, I see *exactly* your problems with my PCEngines Alix 2D3 boards.
> About 30% of boards seem to boot and the RTC driver seems borked,
> therefore you can't set the clock with chrony, but a *warm* reboot
> and the RTC starts correctly next time and hence you can set it, and
> after that it all seems to work...


That does seem similar, but I'm not sure they're really the same.  Our problem boards -- again, most of the same model do work just fine -- seem to grumble or show problems in setting the RTC as if the operation has failed, but then upon a reboot, all suddenly looks good.  I don't have a clear picture yet on warm vs. cold booting.  When I reboot them for my own testing, it's always warm because they're remote and that's all I've got.  However, these boards do routinely get powered down (without any formal shutdown process) because they're attached to the power buss of production machinery, which gets powered off as needed.


> I haven't really debugged this, but:
> - Boards have cmos batteries
> - bolloxed boards show a time which looks feasibly like 2000 + some
> number of days since battery inserted, ie RTC is kind of working


I just gathered some details.  We took delivery of the boards around Feb. 2009 and we were waiting for the mfr. to build-to-order, so we're talking almost 4 years that the batteries have been connected and our clocks are init'ing at Dec. 2003.  Oh man!  What a clue!

Here's where I got to reveal some newly discovered weirdness though.  And this may just prove I don't understand what 'rtcdata' is doing exactly, but in this interaction, I brought about the expected 1970 value suddenly and I usually get something in 2003:

# chronyc
chrony version 1.26-20110831gitb088b7
Copyright (C) 1997-2003, 2007, 2009-2011 Richard P. Curnow and others
chrony comes with ABSOLUTELY NO WARRANTY.  This is free software, and
you are welcome to redistribute it under certain conditions.  See the
GNU General Public License version 2 for details.

chronyc> password SuperSecretSquirrelSauce
200 OK
chronyc> rtcdata
RTC ref time (UTC) : Wed Jan  1 00:08:31 2003
Number of samples  : 0
Number of runs     : 5
Sample span period :    0
RTC is fast by     : -310221952.000000 seconds
RTC gains time at  :    16.899 ppm
chronyc> exit
# systemctl stop chronyd.service
# hwclock
hwclock: select() to /dev/rtc to wait for clock tick timed out: Success
# systemctl start chronyd.service
# hwclock
Tue 30 Oct 2012 10:46:02 AM EDT  -0.997601 seconds
# chronyc
chrony version 1.26-20110831gitb088b7
Copyright (C) 1997-2003, 2007, 2009-2011 Richard P. Curnow and others
chrony comes with ABSOLUTELY NO WARRANTY.  This is free software, and
you are welcome to redistribute it under certain conditions.  See the
GNU General Public License version 2 for details.

chronyc> password SuperSecretSquirrelSauce
200 OK
chronyc> rtcdata
RTC ref time (UTC) : Thu Jan  1 00:00:00 1970
Number of samples  : 0
Number of runs     : 0
Sample span period :    0
RTC is fast by     :     0.000000 seconds
RTC gains time at  :     0.000 ppm


> - lots of clock based problems on boot, rtc not working, seems to
> cascade in strange ways and then networking won't start and lots of
> cascading issues. Not debugged the chain of events though. I think
> only a warm reboot gets RTC working, and then it can be set
>
> - *Feels* like some RTC are in some invalid state on boot and the
> kernel driver pukes
> - Warm reboot allows the kernel rtc driver to start
> - Setting "something" in the RTC then makes it work correctly there onwards...
>
>
> Therefore suspect some RTC clocks can be in an "initial" state which
> the kernel driver doesn't correctly handle and initialise. Probably
> worth filing some bug with kernel guys,


Yeah, I'm starting to get that impression.  I'm going to try and discover what chip is being used on these boards and see if I can learn of any known quirks with those.

> I think if you can bear to do a one off script when you get new
> machines to ensure a warm reboot and RTC reset, then they will all
> behave fine thereafter?


I had hoped for that.  I actually have a similar solution embedded in their init scripts for several years now and it seems to only "paper over the problem".  I believe the attempt reduces the count that I see borked, but there are always some that are borked it seems.

Yesterday I manually went through a bunch needing the "2003 trim".  All acted like the repair did no good until I warm rebooted them and then all looked good.  
Today I see one that was corrected is now borked again!  I guess I have to expect this, otherwise my init script solution should have repaired them all by now, even if it only worked like 1 out of 1000 times.

> If you find this then please document here
> for benefit of all?  


Certainly will if I can nail this down concretely.

--
John Florian



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/