Re: [chrony-users] Chrony forgets servers (specified by FQDN) when no DNS server

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


On 12/20/2017 11:51 AM, Rob Janssen wrote:
A time server that uses DNS based rules for reference servers should fail gracefully when the DNS does not return an IP address (anymore).  So, when it does a lookup only once it should issue an error message about that server, and proceed its startup as if that server was never there in the configuration.  When it is resolving DNS names on a regular basis (e.g. once per day), it could keep the server configuration and keep retrying the DNS lookup at
that same interval and start using the server when the DNS lookup succeeds.
Not starting the service at all is only an option when all the DNS lookups have failed (i.e. there is no server) and there is no mechanism to re-try the lookups.  When there is, it is much better to keep the service running. (after all, a network may not be available at boot time and may become available later)

I find this statement of behavior (treat NOSERV/NXDOMAIN as an excuse to forget a server/peer/pool) a bit astonishing, and very un-Unix-like.

Let's make some assumptions:
1. The daemon software has, in its data structures for server/peer/pool, the FQDN for each server and peer. 2. The daemon software, on NXDOMAIN or no answer, sets the IP address to zeros (0xFFFFFFF for IPv4, and 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 for IPv6) 3. All information about the server/peer/pool entry is in the data structure, such as filter data 4. The polling loop is able to fork a process to perform DNS lookups. (This many not necessarily be true with Windows.)

So the standard polling loop uses the poll timing specified in the server/peer/pool command for all servers, peers, and pools, initialized or not. If the poll interval has expired for a given server/peer/pool entry, it does this: a. IP address zero: reset pool interval to minpoll, and fork a process to do DNS lookup -- the forked process will perform the DNS lookup, and on success will fill in the IP address and set the first-time flag so the polling loop will pick it up in the next cycle b. IP address non-zero and first-time flag set: do what the server currently does with a new server or peer entry b. IP address non-zero and first-time flag not set: do what it does now.

Forking a process means that the daemon's polling loop doesn't lock up the daemon on the DNS lookup when there is no DNS available, or it takes a double-handful of seconds to get NOSERV or NXDOMAIN. (If a process is already forked for an entry, then don't fork it again; wait for the forked process to die.) If/when the forked process gets a successful A or AAAA record, it sets it in the data structure for the entry so that the pool loop will pick it up on the next poll interval expiration.

Also note that it eliminates special start-up code. The config file parser fills in the data structure for each server/peer with zero IP address, and the polling loop handles the lookup and initialization. This also works with chronyc(1): it causes chronyd(8) to build the new data structure, and the polling loop does the rest. When you use chronyc(1) to remove a server or peer, chronyd(8) just removes the data structure for that entry. Poof.

And that's how I would remove chrony's current astonishing behavior in the face of DNS not being there at start-up. Like in my power-fail situation, where the edge router with chronyd(8) comes up before the CSU/DSU to the network. Enterprise users might be surprised to learn about this astonishing forgetfulness of chronyd(8) in the face of a temporary failure.

How to handle entries where the NTP server has gone away?

Keep a TTL timer, set by an entry in the configuration file. (reasonable default would be 24 hours.) When "reach" is not 0x00, reset the TTL timer. When the TTL timer expires, clear the filter variables, set the poll to minpoll, zero the IP address, and reset the TTL timer.

The rationale for this method of handling extended tempfail is the same rationale used for SMTP daemons: wait somewhat impatiently for the remote server to come back, and if it doesn't come back in a reasonable time then bounce the mail.

From the standpoint of NTP protocol, a server that is out of service for an extended time may have different properties when it comes back on-line. (Replaced, for example.) So the filter variables would contain bogus data, particularly in a pool situation where you were originally talking to a "close" server, and now switched to a "far" server.

(And, it eliminates the need for a separate "pool" command, which would help some distribution sources (<cough> Red Hat) who use "server" when they mean "pool" in their default configurations.)

If this should be moved to chrony-dev, I can do that.

--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject. For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/