Re: [chrony-users] Server failover

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


Instead of the local directive I'd suggest to use the same refclock
directive but modified with "stratum 14 delay 1".

Just to follow up, this ended up working well, thanks! I added a chronyd-shutdown.service systemd unit to run chronyd as its ExecStop with the -t and -f switches to run for ~2 poll intervals with the alternate config after chronyd.service shuts down.
 
> We use redundant time servers in each of our datacenters, and we'd like to
> actively switch away from time servers before their scheduled reboots,
> because the estimated error that accumulates on our clients while waiting
> to notice that a server has gone away can be too high for our standards.

As Kevin explained, ideally clients should handle that on their own. I
understand you have tight requirements on accuracy. Have you tried
setting reselectdist on clients to 10 microseconds or less?

That didn't really affect things in our case, even when setting it to 1us or zero. (I did, however, manage to confuse myself for a while by setting it to 10s and effectively breaking reselect.)

I did some experimenting and read the code a bit. What I seem to be seeing is, after "chronyc offline somehost", if that host had the shortest distance (which it generally will if it was selected), it will continue to have the shortest distance most of the time indefinitely, which means the scores in selectdata stay near 1.0. They rarely reached the 10.0 threshold after offlining the selected host, and the host stayed selected until it went SRC_STALE after around 15 minutes.

This was a bit surprising to me.

1. I would have figured that offlining a host should cause us to deselect it immediately and not treat it as a candidate, at least if there are other hosts we could use. As far as I can tell, only "delete" will work for that.
2. If we're getting updates from our other trusted and prefered server(s) every 8-16 seconds and we haven't heard from the selected one in a minute or two, I'd really like to just switch servers at that point. "Newest update is older than the oldest update of all other servers" seems like a really high bar for switching away from a machine that has fallen over.



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/