RE: [chrony-users] chronyd.service doesn't have Restart=on-failure?

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


This box is reporting detailed metrics through AWS CloudWatch, but there's a 10 minute hole in reporting data during the event. This event happened in mid-December, so my CloudWatch logging detail is reduced to 5 minute windows.

I can see is that we were at 88%+ memory usage and mid 50% CPU usage during the period leading up to the failure and immediately afterwards. I do have detailed syslog data, though, and 10 minutes before chronyd died clamav also died due to an error that is related to an out of memory condition. There's some other evidence (consul logs on other boxes) indicating that other instances were having trouble reaching the problem instance. Something was up with the box, obviously.

My working theory is that this problem occurred because chronyd lost network connectivity which would be very similar, conceptually, to losing name resolution. It would take more effort to replicate the behavior than I have time for but setting the time server to some IP that's unreachable, and setting maxpoll to the same as minpoll (or perhaps just to 4, regardless of what minpoll is) should be sufficient, I think. 

Since then we've changed the maxpoll setting and have doubled the instance's memory (4GB to 8GB), but we never were able to figure out exactly what happened. It seemed transient, but the failure of chronyd to not even attempt to restart seemed surprising to me.

Perhaps some more documentation about making sure that maxpoll is some number larger than minpoll?

--Jamie

Jamie Gruener | Director of IT & Security, Biospatial, Inc. | 919-624-9760 | jamie.gruener@xxxxxxxxxxxxx

-----Original Message-----
From: Miroslav Lichvar <mlichvar@xxxxxxxxxx> 
Sent: Monday, January 11, 2021 10:38 AM
To: chrony-users@xxxxxxxxxxxxxxxxxxxx
Subject: Re: [chrony-users] chronyd.service doesn't have Restart=on-failure?

CAUTION: This email originated from outside of Biospatial. Do not click links or open attachments unless you recognize the sender and know the content is safe.


On Mon, Jan 11, 2021 at 01:44:03PM +0000, Jamie Gruener wrote:
> Your comment about timeouts on top of timeouts is what I was thinking, too, and with maxpoll the same as minpoll, it wouldn't take long for us to run out of timeouts--only 16 seconds. I think this means that if chronyd can't reach a timeserver within maxpoll, it'll generate this error. Apparently this doesn't happen very often because Google produces nearly zero hits for that error. In that sense it is working as designed configured. Having a short maxpoll was clearly a mistake.

The minpoll and maxpoll options control the chronyd's polling interval. If the source stops responding, the polling interval will slowly move to the maxpoll, but that shouldn't cause any issues.
That's an expected state. I think the fatal error could happen even if the source was responding. The issue seems to be in processing of the timeouts when it is so slow that received packets cannot be processed between them.

The smaller minpoll and maxpoll values might make it more likely to trigger the error, but I don't see how it could happen in a normal operation. I can only reproduce it in a debugger, or by inserting a sleep in the right place.

Basically, the execution of the chronyd process needs to slow down so much that after adding a new timeout, the following check whether there are any timeouts left to be processed needs to see the new timeout as already being late. This needs to happen several times in a row. It's not just stopping and resuming the process. It needs to stop at the right place.

Was there anything else happening around that time when it crashed?
I'm not sure what that could be.

The only other report of this error that I know was found to be a bug triggered by slow name resolving. That was many years ago when chronyd didn't even resolve names asynchronously in a separate thread.

> But I think the lines:
> > Restart=on-failure
> > RestartSec=30s
> Should be added in the [Service] section. If chronyd fails for any reason, given how important time is, I can't think why it wouldn't try to restart..

The issue is that restarting chronyd will allow the system clock to be stepped again. That can be a serious issue in some environments. It could be suppressed by the -R option, but there doesn't seem to be a way to add it to the command line of the restarted chronyd. If you are ok with that, your custom unit file should work fine. But I'd like chronyd to not be so buggy that it is necessary to restart it automatically.

--
Miroslav Lichvar


--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/