Re: [chrony-users] Chrony malfunction under ARM/Buildroot

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]




William G. Unruh   |  Canadian Institute for|     Tel: +1(604)822-3273
Physics&Astronomy  |     Advanced Research  |     Fax: +1(604)822-5324
UBC, Vancouver,BC  |   Program in Cosmology |     unruh@xxxxxxxxxxxxxx
Canada V6T 1Z1     |      and Gravity       |  www.theory.physics.ubc.ca/

On Thu, 14 Apr 2016, Mauro Condarelli wrote:

Thanks Bill,
comments inline below.

Il 14/04/2016 21:08, Bill Unruh ha scritto:

On Thu, 14 Apr 2016, Mauro Condarelli wrote:

Hi,
I'm having major problems using Chrony in an embedded system sporting an oldish Atmel AT91SAM9G25.

I am using chrony-2.3 compiled with "--without-seccomp --without-tomcrypt --without-nss" under Linux 4.4.1.

/etc/chrony.conf is:
============================
#initstepslew 30 2.it.pool.ntp.org 0.pool.ntp.org 1.pool.ntp.org

server 0.it.pool.ntp.org
server 1.it.pool.ntp.org
server 2.pool.ntp.org
server 3.pool.ntp.org

logdir /var/log/chrony
log rtc statistics measurements tracking
logchange 1
driftfile /var/lib/chrony/drift
keyfile /etc/chrony.keys
generatecommandkey
makestep 1.0 3
maxupdateskew 100.0
dumponexit
dumpdir /var/lib/chrony
rtconutc
rtcautotrim 1
rtcfile /var/lib/chrony/rtc
bindcmdaddress /var/run/chrony/chronyd.sock
============================

I have two (possibly related) malfunctions:
1) if and when the RTC is severely off (complete loss of power, including battery backup) the system comes up with a specific date (Jan, 11th 2007) and after a while (i.e.: after chrony had the time to synchronize over the network) system seem to hang with very few systems actually working (notably the system answers to "ping") so that it seems completely dead. It turns out it is not really dead, but only in "suspended animation" as it always recovers completely (with correct date & time) after a LOOOOooonggggg time (>5min).

With no rtc ( which is a hardware problem not a chrony problem) there is no
way the system can know what the time is or how to make sure everything is
always monotonic. You will just have to make sure that the problem never
AFAIK chrony (I start it as: "chronyd -r -s") should look for the "driftfile" (present in my system) and set the time&date from it; this does not seem to happen.

That is a driftfile for the rtc. If the rtc is way way out, it has nothing to
fix-- ie the fix is nonesense. If your rtc is as flaky as you imply, it might
be better not to use the rtc at all. AFAI remember, if there is not rtc (ie
the rtcfile directive) chrony will look at a few files, and use the dates from
them to set the current time. That will be out (depending on how long th
emachine has been  off) but it will be in the past, so your concern about the
system always going forward will be satisfied.


occurs. I have never seen chrony "hang" the system. There is something else going on. Have you looked at all the logs-- dmesg, journalctl or var/log/messages, ... to see what is going on? It may be that some program is noticing that chrony
is changing the time, and goes into a snit.
I find no message whatsoever in any of the logfiles.
When I see this "hang" it really seems everything freezes for a long while.
Without errors and without consequences (aside from the freeze itself, of course).



2) if the system comes up well (chrony seems to work as advertised) I sometimes get very bad system crashes (see below).

Neither of these conditions happen if I kill chrony early (before it has any chance to synchronize) or if it is inactive (e.g.: if system is disconnected from the internet).
As said this seems linked to chrony as neither condition happens if chrony is not running.

Yes, but it still might not be chrony. On the other hand the log file you
showed suggests that it is the rtc that is causing trouble. (Is it HPET? or
what is the rtc on your system).

I will try substituting nntp/ntpdate in order to see if the problem is in touching time (i.e.: it is in the kernel itself).

The logs suggest that it is a problem with the rtc. You might well be better
off not using the rtc at all. Is it a separate card?





What should I check?
Can someone help me debugging this?

NOTE: the need is to have a system with strictly monotonic time (i.e.: time can jump ahead, but can never jump back for any reason, even across reboots).
If necessary I can test with specific patches... or change completely.

Thanks in Advance
Mauro

============================
[ 232.117187] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:298]

OK, that has nothing to do with chrony. What are you running?
Nothing specific.
This happens when chrony gets the new internet time.
Actually at that time there is a lot of "normal" activity going on as this happens right after dhcpcd negotiated a new IP address;
I have the following /etc/dhcpcd.exit-hook (as advised by Miroslav Lichvar):
============================
#!/bin/ash
echo >>/dev/ttyS0 "------------ $reason ($interface)($if_up)($if_down) -----------"
case $reason in
NOCARRIER)
   echo >>/dev/ttyS0 "--------- chrony offline ---------"
   /usr/bin/chronyc offline
   ;;
BOUND|REBOOT)
  echo >>/dev/ttyS0 "--------- chrony  online ---------"
  (  sleep 1 \
     && /usr/bin/chronyc online \
     && /usr/bin/chronyc burst 4/4 \
     && sleep 10 \
&& (/usr/bin/chronyc tracking | grep -q 'System time.*slow' && chronyc makestep ) \
   ) &
   ;;
esac
============================

[ 232.117187] Modules linked in: cfg80211 usb_f_acm usb_f_serial libcomposite configfs u_serial atmel_usba_udc [ 232.117187] CPU: 0 PID: 298 Comm: kworker/0:1 Tainted: G W 4.4.1 #1
[  232.117187] Hardware name: Atmel AT91SAM9
[  232.117187] Workqueue: events rtc_timer_do_work

So your rtc seems to be having trouble.
Agreed


[  232.117187] task: c7a6f420 ti: c799c000 task.ti: c799c000
[  232.117187] PC is at timerqueue_del+0x1c/0x84
[  232.117187] LR is at rtc_timer_do_work+0xac/0x1a8
[  232.117187] pc : [<c01c8e2c>]    lr : [<c02c43fc>] psr: 00000013
[  232.117187] sp : c799de00  ip : c799de20  fp : c799de1c
[  232.117187] r10: 00000000  r9 : c7b7ea20  r8 : c7edee00
[  232.117187] r7 : c7b7e9e0  r6 : c7b7e978  r5 : c7b7e9e0  r4 : c7b7ea20
[  232.117187] r3 : 00000001  r2 : 04405600  r1 : c7b7ea20  r0 : c7b7e9e0
[ 232.117187] Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[  232.117187] Control: 0005317f  Table: 273dc000  DAC: 00000053
[ 232.117187] CPU: 0 PID: 298 Comm: kworker/0:1 Tainted: G W 4.4.1 #1
[  232.117187] Hardware name: Atmel AT91SAM9
[  232.117187] Workqueue: events rtc_timer_do_work
[ 232.117187] [<c0010468>] (unwind_backtrace) from [<c000dad0>] (show_stack+0x20/0x24) [ 232.117187] [<c000dad0>] (show_stack) from [<c01c10c0>] (dump_stack+0x20/0x28) [ 232.117187] [<c01c10c0>] (dump_stack) from [<c000afa4>] (show_regs+0x1c/0x20) [ 232.117187] [<c000afa4>] (show_regs) from [<c006a748>] (watchdog_timer_fn+0x150/0x1a4) [ 232.117187] [<c006a748>] (watchdog_timer_fn) from [<c00515b8>] (__hrtimer_run_queues.constprop.5+0x140/0x21c) [ 232.117187] [<c00515b8>] (__hrtimer_run_queues.constprop.5) from [<c005181c>] (hrtimer_interrupt+0xa0/0x1c0) [ 232.117187] [<c005181c>] (hrtimer_interrupt) from [<c02e26b8>] (ch2_irq+0x30/0x38) [ 232.117187] [<c02e26b8>] (ch2_irq) from [<c00469e8>] (handle_irq_event_percpu+0x7c/0x1dc) [ 232.117187] [<c00469e8>] (handle_irq_event_percpu) from [<c0046b80>] (handle_irq_event+0x38/0x4c) [ 232.117187] [<c0046b80>] (handle_irq_event) from [<c0049a5c>] (handle_fasteoi_irq+0x9c/0x110) [ 232.117187] [<c0049a5c>] (handle_fasteoi_irq) from [<c0046188>] (generic_handle_irq+0x28/0x38) [ 232.117187] [<c0046188>] (generic_handle_irq) from [<c0046440>] (__handle_domain_irq+0x8c/0xb0) [ 232.117187] [<c0046440>] (__handle_domain_irq) from [<c000948c>] (aic_handle+0xb8/0xc0) [ 232.117187] [<c000948c>] (aic_handle) from [<c000e524>] (__irq_svc+0x44/0x58)
[  232.117187] Exception stack(0xc799ddb0 to 0xc799ddf8)
[  232.117187] dda0: c7b7e9e0 c7b7ea20 04405600 00000001
[ 232.117187] ddc0: c7b7ea20 c7b7e9e0 c7b7e978 c7b7e9e0 c7edee00 c7b7ea20 00000000 c799de1c
[  232.117187] dde0: c799de20 c799de00 c02c43fc c01c8e2c 00000013 ffffffff
[ 232.117187] [<c000e524>] (__irq_svc) from [<c01c8e2c>] (timerqueue_del+0x1c/0x84) [ 232.117187] [<c01c8e2c>] (timerqueue_del) from [<c02c43fc>] (rtc_timer_do_work+0xac/0x1a8) [ 232.117187] [<c02c43fc>] (rtc_timer_do_work) from [<c002c914>] (process_one_work+0x208/0x388) [ 232.117187] [<c002c914>] (process_one_work) from [<c002d720>] (worker_thread+0x2d8/0x430) [ 232.117187] [<c002d720>] (worker_thread) from [<c0031ce4>] (kthread+0xe0/0xf4) [ 232.117187] [<c0031ce4>] (kthread) from [<c000a430>] (ret_from_fork+0x14/0x24)
[  232.117187] Kernel panic - not syncing: softlockup: hung tasks
[  232.117187] CPU: 0 PID: 298 Comm: kworker/0:1 Tainted: G W L  4.4.1 #1
[  232.117187] Hardware name: Atmel AT91SAM9
[  232.117187] Workqueue: events rtc_timer_do_work
[ 232.117187] [<c0010468>] (unwind_backtrace) from [<c000dad0>] (show_stack+0x20/0x24) [ 232.117187] [<c000dad0>] (show_stack) from [<c01c10c0>] (dump_stack+0x20/0x28) [ 232.117187] [<c01c10c0>] (dump_stack) from [<c008c2b0>] (panic+0x88/0x1f0) [ 232.117187] [<c008c2b0>] (panic) from [<c006a774>] (watchdog_timer_fn+0x17c/0x1a4) [ 232.117187] [<c006a774>] (watchdog_timer_fn) from [<c00515b8>] (__hrtimer_run_queues.constprop.5+0x140/0x21c) [ 232.117187] [<c00515b8>] (__hrtimer_run_queues.constprop.5) from [<c005181c>] (hrtimer_interrupt+0xa0/0x1c0) [ 232.117187] [<c005181c>] (hrtimer_interrupt) from [<c02e26b8>] (ch2_irq+0x30/0x38) [ 232.117187] [<c02e26b8>] (ch2_irq) from [<c00469e8>] (handle_irq_event_percpu+0x7c/0x1dc) [ 232.117187] [<c00469e8>] (handle_irq_event_percpu) from [<c0046b80>] (handle_irq_event+0x38/0x4c) [ 232.117187] [<c0046b80>] (handle_irq_event) from [<c0049a5c>] (handle_fasteoi_irq+0x9c/0x110) [ 232.117187] [<c0049a5c>] (handle_fasteoi_irq) from [<c0046188>] (generic_handle_irq+0x28/0x38) [ 232.117187] [<c0046188>] (generic_handle_irq) from [<c0046440>] (__handle_domain_irq+0x8c/0xb0) [ 232.117187] [<c0046440>] (__handle_domain_irq) from [<c000948c>] (aic_handle+0xb8/0xc0) [ 232.117187] [<c000948c>] (aic_handle) from [<c000e524>] (__irq_svc+0x44/0x58)
[  232.117187] Exception stack(0xc799ddb0 to 0xc799ddf8)
[  232.117187] dda0: c7b7e9e0 c7b7ea20 04405600 00000001
[ 232.117187] ddc0: c7b7ea20 c7b7e9e0 c7b7e978 c7b7e9e0 c7edee00 c7b7ea20 00000000 c799de1c
[  232.117187] dde0: c799de20 c799de00 c02c43fc c01c8e2c 00000013 ffffffff
[ 232.117187] [<c000e524>] (__irq_svc) from [<c01c8e2c>] (timerqueue_del+0x1c/0x84) [ 232.117187] [<c01c8e2c>] (timerqueue_del) from [<c02c43fc>] (rtc_timer_do_work+0xac/0x1a8) [ 232.117187] [<c02c43fc>] (rtc_timer_do_work) from [<c002c914>] (process_one_work+0x208/0x388) [ 232.117187] [<c002c914>] (process_one_work) from [<c002d720>] (worker_thread+0x2d8/0x430) [ 232.117187] [<c002d720>] (worker_thread) from [<c0031ce4>] (kthread+0xe0/0xf4) [ 232.117187] [<c0031ce4>] (kthread) from [<c000a430>] (ret_from_fork+0x14/0x24)
[  232.117187] ---[ end Kernel panic - not syncing: softlockup: hung tasks
============================

So it looks like there is something seriously wrong with your rtc. Hardware.
RTC is internal to MCU, so it might be misprogrammed, but I strongly doubt there is a real hardware issue, also because I see this problem in multiple boards.
I will cross-check initializations in both bootloader and .dts
If You have specific suggestion please feel free to advise.

I think I have reached the limits of my competence.






--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Thanks for the prompt answer.
Mauro

--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/