[chrony-users] RE: Chrony is making NTP Server offline on the system boot |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-users Archives
]
- To: "chrony-users@xxxxxxxxxxxxxxxxxxxx" <chrony-users@xxxxxxxxxxxxxxxxxxxx>, "chrony-dev-request@xxxxxxxxxxxxxxxxxxxx" <chrony-dev-request@xxxxxxxxxxxxxxxxxxxx>
- Subject: [chrony-users] RE: Chrony is making NTP Server offline on the system boot
- From: "Iqbal Singh Aulakh -X (iaulakh - HCL AMERICA INC at Cisco)" <iaulakh@xxxxxxxxx>
- Date: Sun, 8 Jan 2023 08:35:42 +0000
- Accept-language: en-US
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=cisco.com; dmarc=pass action=none header.from=cisco.com; dkim=pass header.d=cisco.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=v2fJurqosDuHC91HOdgF6hJ6N/5BOZGm+7QXQroeD1w=; b=IlOYcvoh+TlbqONBxKONO5IZGlbe9flSCuWd+WaAVWuFNyNi5AfW02oEnIaJTfwE9EjJ15nfRrdZEGIuWmGM1yGzH2JbHH40zhBzpSyhcKRi4Gml3Pm1qeUKacpFXX9lYs2MJ/toqFA3gGnsb6O3CNJika9MJIhLUaiStcrHNkhymPr1QJLDS7X9E671EYih1qwiEOm31Fq9iy+X0Auo+QfpDF/Q0n3tHceguyq+ksks6/pgpFRtyUcEo4gyWFnwMS0bqjEWpca4NzP9vQ1DBNTZBik4RaL2T3KbrIki6MiN6RlJW0kMNuAiwwwD7JERe+oslNOT+mniIetqip5yIg==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=QXDtooy2/XB8sTOKUAdaZLnLRaIrBS/pVzvQKFQjRbfy00J8t8V7YCCqSd5rZ4ELD5EF3h/HFj3ePIp6JBOXWx8160/LCVRuYgKeSkFrbNGQhgla3jEdirPB1mgvhIo6lGqraWJiiq10vOWcRj+zHxH9GTbqMUvTxHiwKZ40/Cpe95Z7iXI46/FKJUjEsbh5eDMLe/LYmazopfBrLcFmFZvLWACCDisoKOu+opWcHMKJs/EFpt9EaP/e9KxQNmF5CzV9G0qzR88JypOXH7KZ0Xcq4yQLg7Zt8QfOSWLTkcKPKcEnJYld1/vvfhdpAnv0u8FFdnzP/rLHfNnMdY1NlA==
- Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=cisco.com;
- Cc: "Kiran Kumar Pamula -X (kpamula - HCL TECHNOLOGIES LIMITED at Cisco)" <kpamula@xxxxxxxxx>
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=12027; q=dns/txt; s=iport; t=1673166953; x=1674376553; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=poUR0nGFKFwWU78oYPepAycr/PQw1ndb1IKLLXaFpqI=; b=AEV8MaBewbuaqn26+bd43HTiqn4J+7R9YyMddkFOWOLRaKvUHZtBssVe XzV2eGp5SHkGZV4KcAmm28rzHm9U6Of/KqFZc0Ytv1vEjbNPA6LPruwBT nigRH8241t0BubdurBorqzFooD9Jz46eLDIEEhGmOepIBiGF8fdd3pZlK w=;
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cisco.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=v2fJurqosDuHC91HOdgF6hJ6N/5BOZGm+7QXQroeD1w=; b=R6z5gMuNvEJQdYyTpmtk2Kl+tQ//LwJN2lkyzHw2X5+5njwutHCSjQO38FygiKoHIQokDpmBKO+RIfk1VMmm9S2Hr1tm0JRAeIt9NdXp18U/1kEJRSwsr++viXoUE4aFo8HvP+0qinrxJQ5rYkJA/deaQtFr71cQ59Q8LNdV1lI=
- Ironport-data: A9a23:fq8/EKu/dBqxVSTvhalgaDZcfefnVEteMUV32f8akzHdYApBsoF/q tZmKTyEaPjYM2f3eNgiOdiwox8Bu5GHydEwHAM9+3oxHi0agMeUXt7xwmUckM+xwmwvaGo9s q3yv/GZdJhcokf0/0vrav67xZVF/fngqoDUUIYoAQgsA146IMsdoUg7wbVh29Y32YPR7z6l4 LseneWOYDdJ5BYsWo4kw/rrRMRH5amaVJsw5zTSVNgT1LPsvyB94KE3ecldG0DFrrx8RYZWc QpsIIaRpQs19z91Yj+sfy2SnkciGtY+NiDW4pZatjTLbhVq/kQPPqgH2PU0Rh5v0mW2sfdIw f5Ouq7sGBogBYTzobFIO/VYO3kW0axu4rTLJz20ttaeiheAeHr3yPIoB0YzVWEa0r8oWicVq rpJc3ZUM0vra+GemNpXTsFtj9knJcrrFIgeoXpnizreCJ7KRLiZGv2RuI8Gg1/cgOhoRtT6e fEUTQNRNhTiZDJ/MXI0BashybLAan7XKm0E9w39SbAMy2PT1gt3wf3hPcSQYNGTWMhOgm6Tp njB+XW/HgwKOtWayD6UtHW2iYfnliPhX4QDUaCj++RxqFaT3GsVGVsKSEu1qPy4jl/4XMhQQ 3H44QI0pqQ0sUesVNS4BVuzoWWPuVgXXN84//AGBB+l5fPsulamKE89RxVCZ+565dI/fhUV7 wrc9z/2PgBHvLqQQHOb076bqzKuJCQYRVPugwdZFGPpBPG+/OkOYgLzosVLS/Hq04Wkcd3k6 3Xb83hk3uR7Ydsjjv3jpTj6bySQSo8lpzPZCy3NVW6jqwh+foPgPcqj6EPQ6rBLK4Pxori9U JosxpH2AAMmVMHleMmxrAMlR+rBCxGta2e0vLKXN8N9nwlBAlb6FWyq3BlwJV1yLuEPciLzb UnYtGt5vcEMZyLzM/UtMtrtVKzGKJQM8/y4Cpg4ifITPPBMmPOvp0mCmGbJhTm2yRhw+U3BE c7ALJnE4Ykm5VRPlWrqGLh1PU4DzSElzmSbXoHg0xmiytKjiI29F9843K+1Rrlhtsus+VyNm /4Gbpfi40sECoXWPHKImbP/2HhXdxDX87is9ZwOHgNCSyI7cFwc5wj5mO5wI9E9w/kKzo8lP BiVAydl9bY2vlWfQS3iV5ypQOqHsUpXxZ7jARERAA==
- Ironport-hdrordr: A9a23:VFIvzqvBmHK1mlXChkfB0sxw7skCwoMji2hC6mlwRA09TyXGra 6TdaUguiMc1gx8ZJh5o6H9BEGBKUmskaKdkrNhQotKOzOW9FdATbsSoLcKpgeAJ8SQzJ8k6U 4NSdkdNDS0NykGsS+Y2nj2Lz9D+qj9zEnAv463pB0BLXAIV0gj1XYCNu/xKDwQeOAyP+tBKH Pq3Lsgm9PPQwVzUu2LQl0+G8TTrdzCk5zrJTQcAQQ81QWIhTS0rJbnDhmxxH4lIn1y6IZn1V KAvx3y562lvf3+4ATbzXXv45Nfn8ak4sdfBfaLltMeJlzX+0eVjcVaKv2/VQIO0aOSAWUR4Z zxStAbToBOAkbqDyKISN3Wqk7dOXgVmjnfIBSj8AXeSITCNUMH4ox69Ntkmt+z0Tt6gDm6u5 g7h15x/qAnfS/ojWDz4cPFWAptkVfxqX0+kfQLh3gaSocGbqRNxLZvtH+9Pa1wah4S0rpXWd VGHYXZ/rJbYFmaZ3fWsi1mx8GtRG06GlODTlIZssKY3jBKlDQhpnFojvA3jzMF7tYwWpNE7+ PLPuBhk6xPVNYfaeZ4CP0aScW6B2TRSVbHMX6UI17gCKYbUki94KLf8fEw/qWnaZYIxJw9lN DIV05Zr3c7fwb0BciHzPRwg2fwqaWGLEDQI+1llu1EU+fHNcnW2AW4OSITr/c=
- Ironport-phdr: A9a23:UtCyox+oGXT7QP9uWCXoyV9kXcBvk7n3PwtA7J0hhvoOd6m45J3tM QTZ4ukll17GW4jXqpcmw+rbuqztQyoMtJCGtn1RfJlFTRRQj8IQkkQpC9KEDkuuKvnsYmQ6E c1OWUUj8Wu8NB1eGd31YBvZpXjhhQM=
- Thread-index: AdkjOjFlG38OYXvTTomv4DB5Cs768wAAdx3g
- Thread-topic: Chrony is making NTP Server offline on the system boot
I'm still digging through this issue, but i wanted to share what I've found so far looking at the journal log when chrony debug was enabled:
With Chrony debug enabled
In the journal log you can see when systemd starts the NTP client/server, its after this that the NetworkManager dispatcher would run the 20-chrony script:
Oct 28 05:09:46 gsvlfl4plb02va systemd[1]: Starting NTP client/server...
....
Oct 28 05:09:46 gsvlfl4plb02va systemd[1]: Started NTP client/server.
I can see from the dmesg log the eth1 port is up before the NTP client/server is started:
demsg
[Fri Oct 28 05:09:45 2022] vmxnet3 0000:05:00.0 eth1: NIC Link is Up 10000 Mbps
[Fri Oct 28 05:09:45 2022] vmxnet3 0000:0b:00.0: # of Tx queues : 8, # of Rx queues : 8
So at the point 20-chrony would have run the eth1 link was up so 20-chrony should not have taken the NTP source offline
At this point we can see chronyd continually retrying to connect to both NTP sources, which is what is expected but for some reason it marks the NTP sources as offline:
Oct 28 05:09:50 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:50Z ntp_io.c:316:(connect_socket) Could not connect NTP socket to 10.37.9.162:123 : Network is unreachable
Oct 28 05:09:50 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:50Z ntp_io.c:801:(NIO_SendPacket) No socket to send to 10.37.9.162:123
Oct 28 05:09:51 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:51Z ntp_core.c:1176:(transmit_timeout) Transmit timeout for [172.27.1.96:123]
Oct 28 05:09:51 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:51Z ntp_io.c:316:(connect_socket) Could not connect NTP socket to 172.27.1.96:123 : Network is unreachable
Oct 28 05:09:51 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:51Z ntp_io.c:801:(NIO_SendPacket) No socket to send to 172.27.1.96:123
Oct 28 05:09:52 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:52Z cmdmon.c:1330:(read_from_cmd_socket) Received 28 bytes from /var/run/chrony/chronyc.2173.sock fd 6
Oct 28 05:09:52 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:52Z ntp_io.c:316:(connect_socket) Could not connect NTP socket to 10.37.9.162:123 : Network is unreachable
Oct 28 05:09:52 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:52Z ntp_core.c:2357:(NCR_SetConnectivity) Source 10.37.9.162 offline
Oct 28 05:09:52 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:52Z ntp_io.c:316:(connect_socket) Could not connect NTP socket to 172.27.1.96:123 : Network is unreachable
Oct 28 05:09:52 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:52Z ntp_core.c:2357:(NCR_SetConnectivity) Source 172.27.1.96 offline
In normal working sites chrony keeps retrying to connect to the source and doesn't mark the source offline. In fact I can see in the logs after the NTP source is marked offline chrony still trys to connect to the source and eventually the source responds:
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z ntp_io.c:906:(NIO_SendPacket) Sent 48 bytes to 172.27.1.96:123 from [UNSPEC] fd 7
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z ntp_io.c:704:(process_message) Received 48 bytes from 172.27.1.96:123 to 107.76.43.34 fd=7 if=3 tss=1 delay=0.000046627
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z ntp_core.c:1803:(receive_packet) NTP packet lvm=44 stratum=1 poll=6 prec=-25 root_delay=0.000000 root_disp=0.001617 refid=414c4e31
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z ntp_core.c:1808:(receive_packet) reference=1666933783.133786344 origin=2732877249.633040371 receive=1666933799.173438300 transmit=1666933799.173469351
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z ntp_core.c:1811:(receive_packet) offset=-0.046388364 delay=0.030132085 dispersion=0.000060 root_delay=0.030132 root_dispersion=0.001678
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z ntp_core.c:1814:(receive_packet) remote_interval=0.000031051 local_interval=0.030163136 response_time=0.000031051 txs=D rxs=K
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z ntp_core.c:1819:(receive_packet) test123=111 test567=111 testABCD=1111 kod_rate=0 interleaved=0 presend=0 valid=1 good=1 updated=2
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z sources.c:342:(SRC_AccumulateSample) ip=[172.27.1.96] t=1666933799.219842189 ofs=0.046388 del=0.030132 disp=0.001678 str=1
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z sourcestats.c:699:(SST_GetSelectionData) n=1 off=0.046388 dist=0.016774 sd=4..000000 first_ago=0.015128 last_ago=0.015128 selok=0
Oct 28 05:09:59 gsvlfl4plb02va chronyd[925]: 2022-10-28T05:09:59Z sources.c:775:(SRC_SelectSource) badstat=2 sel=0 badstat_reach=1 sel_reach=0 size=0 max_reach_ago=0.000000
This may be why we see the sources marked as online when "chronyc activity" was run...but why it still continued to try the sources after they are marked offline i'm not sure
Since I could see the source file and function where the offline was done i went to the chrony code to see what it was doing. Below are snippets from the code:
ntp_core.c (chrony/ntp_core.c at 3.5-stable * mlichvar/chrony * GitHub)
int auto_offline; /* If 1, automatically go offline when requests
cannot be sent */
/* ================================================== /
/ Timeout handler for transmitting to a source. */
static void
transmit_timeout(void *arg)
{
NCR_Instance inst = (NCR_Instance) arg;
....
/* With auto_offline take the source offline if sending failed */
if (!sent && inst->auto_offline) <------ auto_offline must have be set since we enter NCR_SeConnectivity() function
NCR_SetConnectivity(inst, SRC_OFFLINE);
....
}
void
NCR_SetConnectivity(NCR_Instance inst, SRC_Connectivity connectivity)
{
....
case SRC_OFFLINE:
switch (inst->opmode) {
case MD_ONLINE:
LOG(LOGS_INFO, "Source %s offline", s);
take_offline(inst);
break;
case MD_OFFLINE:
break;
case MD_BURST_WAS_ONLINE:
inst->opmode = MD_BURST_WAS_OFFLINE;
LOG(LOGS_INFO, "Source %s offline", s); <----- This is line 2357 seen in the journal log above
break;
case MD_BURST_WAS_OFFLINE:
break;
default:
assert(0);
}
break;
default:
assert(0);
}
It appears somehow the auto_offline is getting set to one even though it is not configured in the chrony.conf. My guess is there's a bug somewhere in the chrony code that's setting the auto_offline to 1. I'm still digging through the code to see what i can find
--
Thanks & Regards,
Iqbal Singh
-----Original Message-----
From: Iqbal Singh Aulakh -X (iaulakh - HCL AMERICA INC at Cisco) <iaulakh@xxxxxxxxx>
Sent: January 8, 2023 12:34 AM
To: chrony-users@xxxxxxxxxxxxxxxxxxxx; chrony-dev-request@xxxxxxxxxxxxxxxxxxxx
Cc: Kiran Kumar Pamula -X (kpamula - HCL TECHNOLOGIES LIMITED at Cisco) <kpamula@xxxxxxxxx>
Subject: [chrony-users] Chrony is making NTP Server offline on the system boot
Hello
We are having issue with chrony and it's making ntp server offline only on system boot.
Need hand and feet support in this regard.
Issue: Chrony Marking NTP servers as offline Upon reboot of the nodes.
We do not see anything in the message log or puppet logs.
Customers' Multiple Data centers/Sites got affected by this issue.
Due to the marking of NTP offline, the clock went out of sync, resulting in a loss of traffic and causing a nationwide outage.
Output from some of the affected nodes.
This is not limited to CentOS 7, CentOS 8 or Alma Linux and happened in multiple environments.
The last reported issue was in CentOS 7
Currently, we captured the data on CentOS; they are the affected nodes in the customer's setup.
Note: In our implementation, we are not managing network interfaces with Network Manager and Looked into the following.
And it did not make any difference.
Another Note: Some nodes are affected in Multiple Data Centers, and the rest are fine.
In affected nodes, it's reproducible upon reboot and where we have not seen this issue, even in the same env on different nodes.
it's not happening over there.
We tried the following link from red hat, but it was also not applicable.
https://access.redhat.com/solutions/3968471
We made changes to our puppet to bring up interfaces only after network interfaces are up but it didn't make any difference.
Here is the output
[qns@pcc01 ~]$ uname -a
Linux pcc01 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[qns@pcc01 ~]$ cat /etc/*-release
CentOS Linux release 7.6.1810 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL=https://www.centos.org/
BUG_REPORT_URL=https://bugs.centos.org/
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
CentOS Linux release 7.6.1810 (Core)
CentOS Linux release 7.6.1810 (Core)
[qns@pcc01 ~]$ rpm -qa | sort -i | grep -i '(chrony|ntp)'
chrony-3.2-2.el7.x86_64
fontpackages-filesystem-1.44-8.el7.noarch
ntp-4.2.6p5-28.el7.centos.x86_64
ntpdate-4.2.6p5-28.el7.centos.x86_64
[qns@pcc01 ~]$ ps auxwwwf | grep -i '(chrony|ntp)'
ntp 30958 0.0 0.0 25720 1896 ? Ss 2021 0:50 /usr/sbin/ntpd -u ntp:ntp -g qns 3827 0.0 0.0 112708 1012 pts/0 S+ 19:30 0:00 _ grep --color=auto -i (chrony|ntp)
[qns@pcc01 ~]$ date
Thu Sep 1 19:30:04 UTC 2022
[qns@pps02 ~]$ chronyc sources
210 Number of sources = 2
MS Name/IP address Stratum Poll Reach LastRx Last sample ^? lb01 2 10 0 44d +3581ms[+3581ms] +/- 16.0s ^? lb02 3 10 0 47d +46us[ +46us] +/- 15.9s
[qns@pps02 ~]$ chronyc tracking
Reference ID : AC1AF20C (lb02)
Stratum : 4
Ref time (UTC) : Thu Jun 09 10:20:05 2022 System time : 0.000000153 seconds slow of NTP time Last offset : -0.000002489 seconds RMS offset : 0.187102452 seconds Frequency : 17.639 ppm slow Residual freq : -0.000 ppm Skew : 0.002 ppm Root delay : 0.038741432 seconds Root dispersion : 7.168707848 seconds Update interval : 1031.8 seconds Leap status : Normal
[qns@pps02 ~]$ chronyc sourcestats
210 Number of sources = 2
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev
lb01 18 11 92m +6.918 0.022 +30.4s 42us
lb02 64 33 18h +0.000 0.000 +885us 18us
[qns@lb01 log]$ chronyc sources
210 Number of sources = 2
MS Name/IP address Stratum Poll Reach LastRx Last sample
^* NTP-server1.> 1 6 0 50d -25us[ -161us] +/- 17ms
^- NTP-server2.> 2 6 0 50d +685us[ +685us] +/- 34ms
[qns@lb01
log]$ chronyc tracking
Reference ID : 9BAED650 (NTP-server1.com) Stratum : 2 Ref time (UTC) : Wed Jun 08 06:58:03 2022 System time : 0.000000028 seconds slow of NTP time Last offset : -0.000135603 seconds RMS offset : 0.000135603 seconds Frequency : 15.377 ppm slow Residual freq : -31.511 ppm Skew : 0.026 ppm Root delay : 0.030828953 seconds Root dispersion : 142.074813843 seconds Update interval : 2.1 seconds Leap status : Normal
[qns@lb01 log]$ chronyc sourcestats
210 Number of sources = 2
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev NTP-server1.> 4 4 6 -31.511 1011.791 -137.6s 149us NTP-server2.> 4 4 6 -179.900 1376.575 -785.5s 152us
Thanks
Iqbal Singh
--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "help" in the subject.
Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.
--
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx
with "help" in the subject.
Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.