Re: [chrony-users] chrony configuration help (please)

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]



*UPDATE*
This gets weirder each time I test it.

This is console output:

....
adding dns 192.168.7.254
Starting lighttpd: OK
Starting sshd: OK
date
Wed Dec 28 23:49:16 CET 2005
/ # date
Wed Aug  5 09:02:20 CEST 2015

<---- here the system froze and I blindly typed "<up-arrow><CR>" (notice freezing does NOT happen immediately after "System clock was stepped by 302947914.986165 seconds" ---->

/ # date
Wed Aug  5 09:08:40 CEST 2015
/ #

<---- here system is alive again (6+ minutes after!!!) ---->
....

Logs read as:

# tail -f /var/log/messages
Dec 28 23:46:20 ok-cash daemon.info avahi-daemon[555]: Registering HINFO record with values 'ARMV5TEJL'/'LINUX'.
Dec 28 23:46:20 ok-cash daemon.info avahi-daemon[555]: Server startup complete. Host name is ok-cash.local. Local service cookie is 3176572848.
Dec 28 23:46:21 ok-cash auth.info sshd[569]: Server listening on 0.0.0.0 port 22.
Dec 28 23:46:21 ok-cash daemon.info avahi-daemon[555]: Service "ok-cash" (/etc/avahi/services/ssh.service) successfully established.
Dec 28 23:46:21 ok-cash daemon.info avahi-daemon[555]: Service "ok-cash" (/etc/avahi/services/sftp-ssh.service) successfully established.
Dec 28 23:49:56 ok-cash auth.info sshd[591]: Accepted password for mcon from 192.168.7.114 port 1081 ssh2
Dec 28 23:49:57 ok-cash daemon.info chronyd[494]: Selected source 212.45.144.88
Aug  5 09:01:52 ok-cash daemon.warn chronyd[494]: System clock was stepped by 302947914.986165 seconds
Aug  5 09:01:54 ok-cash authpriv.notice sudo:     mcon : TTY=pts/0 ; PWD=/home/mcon ; USER=root ; COMMAND=/bin/su -
Aug  5 09:01:55 ok-cash auth.notice su: + /dev/pts/0 mcon:root

<---- Here was the freeze ---->

Aug  5 09:02:22 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -271148296.559 seconds

<---- the line above was NOT displayed till 09:08:39 (i.e.: the line above and below were displayed together) ---->

Aug  5 09:08:39 ok-cash daemon.warn chronyd[494]: Forward time jump detected!
Aug  5 09:08:40 ok-cash daemon.info chronyd[494]: Can't synchronise: no reachable sources
Aug  5 09:11:13 ok-cash daemon.info chronyd[494]: Selected source 212.45.144.88
Aug  5 09:12:03 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -62.054 seconds
Aug  5 09:12:45 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -4.557 seconds
Aug  5 09:13:28 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -5.061 seconds
....

Note I had time to do several things with "correct date" before freeze; if I kill chronyd in this time freeze does *not* happen and time remains good.
it *seems* as if chronyd can successfully step System clock forward, then, at a later time (poll time seems to be 43s), it realizes it has to "Trimming RTC, error = -271148296.559 seconds" and there it stops to later recover with a "Forward time jump detected!".

Further comments inline below.


Il 05/08/2015 03:12, Bill Unruh ha scritto:
Maybe there is something which puts the system to sleep if it is inactive to a
long time. since a million seconds is certainly a long time, that thing could
decide you have been inactive and puts everything to sleep. Then when things
start to happen again, it wakes up
This does not really look true: kernel (at least lower levels) seems active (ping works and serial console input char are correctly buffered; also ssh connection is kept alive).
I was doing absolutely nothing on target that could "revive" it.

It seems like "Trimming RTC" takes a long time to execute when error is large; is that possible?

Can You suggest what I should try?
I'm currently out of ideas.

*NOTE* I know this won't happen in everyday usage, but I cannot bar long power outages or other things causing RTC to have very bogus values; I need to (at least) detect the situation as the "normal user" would start resetting the system much before one minute, let alone six!

William G. Unruh   |  Canadian Institute for|     Tel: +1(604)822-3273
Physics&Astronomy  |     Advanced Research  |     Fax: +1(604)822-5324
UBC, Vancouver,BC  |   Program in Cosmology | unruh@xxxxxxxxxxxxxx
Canada V6T 1Z1     |      and Gravity       | www.theory.physics.ubc.ca/

On Wed, 5 Aug 2015, Mauro Condarelli wrote:

Update inline below.

Il 04/08/2015 20:55, Mauro Condarelli ha scritto:
Hi,
Thanks for the answer.
Comments inline below.


Il 04/08/2015 17:50, Bill Unruh ha scritto:

On Tue, 4 Aug 2015, Miroslav Lichvar wrote:

On Mon, Aug 03, 2015 at 10:56:51PM +0200, Mauro Condarelli wrote:
....


2. In the above event, after several minutes, chrony announces it is going to step by several million seconds (as expected); shortly thereafter system dies... somewhat; i.e.: serial console and ssh are completely unresponsive, but "ping" gets an answer. Nothing is logged.

Hm, that's odd. Can you reproduce the problem by stepping the clock
manually with the date command, e.g. date -s '+ 1000000 sec' ?


In this context might there be an option for chrony to take the intial time
when starting up from the date and time on some file. eg --startfile <nameoffile>
If that directive is there then chrony would use the  mtime (plus say 60 sec)  from that file as the
startup file if the rtc time does not exist, or if the rtc time is earlier
than that mtime. That would solve the problem of an insanely early rtc time or
no rtc at all (rPi for example) and give a time which is at least not totally
crazy.  The problem with iburst is
that the clock is initially set already to a potentially insane time so the
system for a while has a problematic time.
Agreed.
Can You give me specific instructions?
As said this is the first time I attempt to use chrony at all.

Re the problem itself, ping is pretty deep down in the kernel, so much of the
kernel could be dead, and user programs dead and ping will still respond. But
the kernel should not die just because the clock is advanced by a few decades. Sounds like a kernel bug. What OS/Distribution is this again
I do not have a "distribution".
This is an embedded system based on ACME AriaG25.
I recompiled everything from scratch (including arm-gcc) using the Biuldroot framework.

/ # uname -a
Linux ok-cash 3.16.1 #1 Thu Jul 16 14:02:29 CEST 2015 armv5tejl GNU/Linux

System seems to be generically be well behaving, but I *cannot* exclude neither software build nor straight hardware bugs, unfortunately.

Brutally cutting power (mains *and* battery backup, so no graceful chrony termination *and* RTC reset) i get into the error condition:

...
Jan  1 01:00:09 ok-cash user.info kernel: [    3.093750] EXT4-fs (mmcblk0p6): mounted filesystem with ordered data mode. Opts: (null)
Jan  1 01:00:09 ok-cash daemon.info kernel: [    3.492187] udevd[476]: starting version 3.0
Jan  1 01:00:09 ok-cash user.notice kernel: [    3.515625] random: udevd urandom read with 85 bits of entropy available
Jan  1 01:00:11 ok-cash user.notice kernel: [    4.968750] random: nonblocking pool is initialized
Jan  1 01:00:11 ok-cash daemon.info chronyd[494]: chronyd version 1.31 starting
Jan  1 01:00:11 ok-cash daemon.err chronyd[494]: Could not open IPv6 NTP socket : Address family not supported by protocol
Jan  1 01:00:11 ok-cash daemon.err chronyd[494]: Could not open IPv6 command socket : Address family not supported by protocol
Jan  1 01:00:11 ok-cash daemon.info chronyd[494]: Set system time, error in RTC = 31906737.155476
Dec 27 18:01:14 ok-cash daemon.info chronyd[494]: Frequency -1.498 +/- 0.147 ppm read from /var/lib/chrony/drift
Dec 27 18:01:24 ok-cash daemon.info chronyd[494]: System trim from RTC = 0.691058
Dec 27 18:01:24 ok-cash daemon.info init: starting pid 506, tty '': '/bin/ash '
Dec 27 18:01:26 ok-cash user.info kernel: [   17.867187] macb f802c000.ethernet eth0: link up (100/Full)
Dec 27 18:01:29 ok-cash daemon.info avahi-daemon[555]: Found user 'avahi' (UID 1003) and group 'avahi' (GID 1000).
...
Dec 27 18:03:02 ok-cash auth.info sshd[605]: Accepted password for mcon from 192.168.7.114 port 16760 ssh2
Dec 27 18:03:05 ok-cash authpriv.notice sudo:     mcon : TTY=pts/1 ; PWD=/home/mcon ; USER=root ; COMMAND=/bin/su -
Dec 27 18:03:05 ok-cash auth.notice su: + /dev/pts/1 mcon:root
...
/ # date
Tue Dec 27 18:01:58 CET 2005
...
Dec 27 18:05:06 ok-cash daemon.info chronyd[494]: Selected source 193.204.114.233
Aug  4 20:47:36 ok-cash daemon.warn chronyd[494]: System clock was stepped by 303010949.791602 seconds

Here the system is virtually dead.
NOT really true.
Everything froze for a while (so ssh connections were dropped), but system recovered on its own after several minutes.
I actually discovered this because, before switching off my development machine, tried a ssh reconnect; it worked.
Here is an excerpt of /var/log/messages:

...
Dec 27 18:03:02 ok-cash auth.info sshd[605]: Accepted password for mcon from 192.168.7.114 port 16760 ssh2
Dec 27 18:03:05 ok-cash authpriv.notice sudo:     mcon : TTY=pts/1 ; PWD=/home/mcon ; USER=root ; COMMAND=/bin/su -
Dec 27 18:03:05 ok-cash auth.notice su: + /dev/pts/1 mcon:root
Dec 27 18:05:06 ok-cash daemon.info chronyd[494]: Selected source 193.204.114.233
Aug  4 20:47:36 ok-cash daemon.warn chronyd[494]: System clock was stepped by 303010949.791602 seconds

<-----  here is where ssh & serial console froze ----->

Aug  4 20:48:10 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -271104240.791 seconds
Aug  4 20:54:28 ok-cash daemon.warn chronyd[494]: Forward time jump detected!
Aug  4 20:54:28 ok-cash daemon.info chronyd[494]: Can't synchronise: no reachable sources
Aug  4 20:56:41 ok-cash daemon.info chronyd[494]: Selected source 193.204.114.233
Aug  4 20:56:49 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -55.032 seconds
Aug  4 20:57:31 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -4.492 seconds
Aug  4 20:58:13 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -4.957 seconds
Aug  4 20:58:56 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -5.418 seconds
Aug  4 20:59:38 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -4.880 seconds
Aug  4 21:00:21 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -5.343 seconds
Aug  4 21:01:03 ok-cash daemon.info chronyd[494]: Trimming RTC, error = -4.816 seconds

I will do some tests tomorrow to see if and after how much time the serial console will come back to life.
Anyways a hiatus of almost one minute (at least) does not seem very healthy, even if system recovers.
Does anyone have an idea about that could be happening?
I can post the whole startup sequence, if deemed useful.

Pretty please HEEEELP!!!

TiA
Mauro



--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.




--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/