Re: [chrony-users] Silent Failure -- Enhancement Request |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-users Archives
]
On 2024-04-19 16:40, Chris Knox wrote:
Bryah, thanks for the answer. Yes, now that we have the scars, we're
monitoring chronyd's health carefully. But my question goes a bit
Glad you're back up and running.
Just to make sure since the details/constraints of your operational
setup were not mentioned yet - I take it you have seen the
"Installation, configuration, and monitoring" section on
https://chrony-project.org/links.html ?
It contains many pointers to third-party monitoring & alerting
tools. In particular the chrony_exporter for Prometheus, in combination
with Alertmanager, is just plain great and flexible enough for
any conceivable operational process.
In fact based on this thread I filed an issue:
https://github.com/SuperQ/chrony_exporter/issues/75
earlier today and it already resulted in a PR:
https://github.com/SuperQ/chrony_exporter/pull/76
(..just in case anybody else using Prometheus is reading this :)
Fundamentally it's not clear what chrony can or should do when
upstream servers are not available, because it's a bottomless pit
of compounding rules, problems and workarounds, all of which are
very environment- and process-dependent.
So instead of relying on a human to read syslog it's IMHO probably
more reliable and stress-free to let a machine do the job of reading
the existing statistics, aggregating a metric that distinguishes a
warning from an error caused by wonky network delays, switch reboots
or data center movements, and then acts according to *your* specific
proceses (email, SMS, reboot..)
cheers
Holger
--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "help" in the subject.
Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.