Re: [chrony-dev] crash when overloading refclock_sock |
[ Thread Index |
Date Index
| More chrony.tuxfamily.org/chrony-dev Archives
]
- To: chrony-dev@xxxxxxxxxxxxxxxxxxxx
- Subject: Re: [chrony-dev] crash when overloading refclock_sock
- From: Nuno Gonçalves <nunojpg@xxxxxxxxx>
- Date: Fri, 7 Oct 2016 10:10:10 +0200
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=s1ed+V6lWd0Z8vYeKyFhwmNdc/uH5af+Y65kX+js/Jo=; b=C1XdIJ1E7JEyedY1hBXNxscUefHt2yO2o3tRZMq2i4SjsOYDiSCR/hV9yi9kCM4nb2 07WGRKBqZttj/hYkKcPM5PKsYtC26L/12NX/k0cAJgYL0RX9xZ9b4627F6HXJbsAz0WD CuirxdOFfu8rrPHOQJg7haEJ7DV+q8YoiWRvwssZTqI4xxFbIdpzVClfl9rgcIMbV4lW o8OTR/9x1Zwpr74RxoELUCw8FoJP4ubtVaVXT7QXzhE38xICXFA7+I6uvkrSsRaNOB55 SkaU7f2i55OAhB7l8bV3JoVKqPAR/6rKk2N0e75E/HtjsI1tQDRz3cEOhPNQD3LAWdiz lsDA==
It's a SOCK refclock. My client was also publishing to SHM at the same
time but I've now removed it to reduce confusion.
The samples are being produced artificially so values are always:
"true time" = current time truncated to seconds
"measurement time" = current time truncated to seconds - 123ms
I note that with the last patch, it takes so much longer to crash,
because most of the times the samples are being rejected:
2016-10-07T08:05:55Z refclock.c:520:(valid_sample_time) SOCK refclock
sample not valid age=0.508929 ts=1475827554.877000000
lastts=1475827554.877000000
2016-10-07T08:05:55Z refclock.c:520:(valid_sample_time) SOCK refclock
sample not valid age=0.508935 ts=1475827554.877000000
lastts=1475827554.877000000
2016-10-07T08:05:55Z refclock.c:520:(valid_sample_time) SOCK refclock
sample not valid age=0.512626 ts=1475827554.877000000
lastts=1475827554.877000000
2016-10-07T08:05:55Z refclock.c:520:(valid_sample_time) SOCK refclock
sample not valid age=0.512644 ts=1475827554.877000000
lastts=1475827554.877000000
But eventually the samples start to be accepted, and it crashes again.
Even tough the age continuously goes from 0 to 1s (as the samples are
artificial).
My current configuration is with noselect:
pool 2.debian.pool.ntp.org offline iburst
refclock SOCK /var/run/chrony.sock refid SOCK noselect
refclock SHM 0 poll 5 refid SHM noselect
Test code following:
/* compile with:
* clang++ -Wall -std=c++1z refclock_client.cpp
* g++ -Wall -std=c++1z refclock_client.cpp
*/
#include <chrono>
#include <iostream>
#include <sys/socket.h>
#include <sys/un.h>
#include <unistd.h>
using namespace std::chrono;
using namespace std::chrono_literals;
const char *SOCK_PATH = "/var/run/chrony.sock";
struct refclock_sock_struct {
struct timeval tv; // Time of the measurement (system time)
double
offset; // Offset between the true time and the system time (in seconds)
int pulse = 0; // Non-zero if the sample is from a PPS signal, i.e. another
// source is needed to obtain seconds
int leap = 0; // 0 - normal, 1 - insert leap second, 2 - delete leap second
int _pad; // Padding, ignored
int magic = 0x534f434b;
refclock_sock_struct(const seconds ref, const microseconds local) {
tv.tv_sec = duration_cast<seconds>(local).count();
tv.tv_usec = duration_cast<microseconds>(local).count() % 1'000'000;
offset = duration<float>(ref - local).count(); // refclock (true time) is
// ahead by this amount
}
};
bool publish_refclock_sock(const seconds ref, const microseconds local) {
const int fd = socket(AF_UNIX, SOCK_DGRAM, 0);
if (fd < 0) {
std::cerr << "refclock_sock: socket error: " << fd << '\n';
return 0;
}
struct sockaddr_un addr;
memset(&addr, 0, sizeof(struct sockaddr_un));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, SOCK_PATH, sizeof(addr.sun_path) - 1);
const refclock_sock_struct sample(ref, local);
const int rv = sendto(fd, &sample, sizeof(sample), 0,
(struct sockaddr *)&addr, sizeof(struct sockaddr_un));
close(fd);
return (rv >= 0);
}
int main() {
bool refclock_sock_publish = true;
bool refclock_sock_listening;
for (auto count = 1;; ++count) {
auto local = system_clock::now().time_since_epoch();
const seconds gps = duration_cast<seconds>(local);
local = duration_cast<seconds>(local) -
123ms; // simulate our systemclock is behind gps by
// truncate to last second time and subtracting 123ms.
if (refclock_sock_publish) {
refclock_sock_listening =
publish_refclock_sock(gps, duration_cast<microseconds>(local));
} else {
refclock_sock_listening = false;
}
std::cout << count << " SOCK [" << (int)refclock_sock_publish << ':'
<< (int)refclock_sock_listening << "]\n";
}
}
Thanks,
Nuno
On Fri, Oct 7, 2016 at 9:05 AM, Bill Unruh <unruh@xxxxxxxxxxxxxx> wrote:
> Did you really have 23 samples come at the same time to a nanosec? I have no
> idea how that is evan possible. Or did you artificially swamp the refclock
> shm?
>
>
> William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
> Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
> UBC, Vancouver,BC _|_ Program in Cosmology |____ unruh@xxxxxxxxxxxxxx
> Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
>
>
> On Fri, 7 Oct 2016, Nuno Gonçalves wrote:
>
>> Hi,
>>
>> Still crashes:
>>
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 63 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 0 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 1 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 2 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 3 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 4 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 5 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 6 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 7 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 8 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 9 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 10 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 11 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 12 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 13 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 14 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 15 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 16 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 17 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 18 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 19 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z refclock.c:713:(filter_add_sample) filter sample
>> 20 t=1475822340.876999761 offset=0.123000145 dispersion=0.000000001
>> 2016-10-07T06:39:01Z sources.c:360:(SRC_AccumulateSample) ip=[SOCK]
>> t=1475822340.876999761 ofs=-0.123000 del=0.000000 disp=-nan str=0
>> 2016-10-07T06:39:01Z sourcestats.c:550:(SST_DoNewRegression) off=-nan
>> freq=-nan skew=-nan n=3 bs=0 runs=3 asym=0.000000 arun=0
>> chronyd: sourcestats.c:388: find_best_sample_index: Assertion
>> `best_index >= 0' failed.
>> Aborted (core dumped)
>>
>> Thanks,
>> Nuno
>>
>> On Thu, Oct 6, 2016 at 4:20 PM, Miroslav Lichvar <mlichvar@xxxxxxxxxx>
>> wrote:
>>>
>>> On Wed, Oct 05, 2016 at 10:56:04AM +0200, Nuno Gonçalves wrote:
>>>>
>>>> I was testing a refclock_sock provider and when the rate of messages
>>>> is really high (over 50k/s), I manage to crash chronyd.
>>>>
>>>> Yes, this is not the standard use case, usually I publish 1/s, but
>>>> anyway chrony ideally should resist to this abuse.
>>>>
>>>>
>>>> 2016-10-05T07:58:01Z refclock.c:707:(filter_add_sample) filter sample
>>>> 52 t=1475654280.876999978 offset=0.122999928 dispersion=0.000000001
>>>> 2016-10-05T07:58:01Z refclock.c:707:(filter_add_sample) filter sample
>>>> 53 t=1475654280.876999978 offset=0.122999928 dispersion=0.000000001
>>>> 2016-10-05T07:58:01Z refclock.c:707:(filter_add_sample) filter sample
>>>> 54 t=1475654280.876999978 offset=0.122999928 dispersion=0.000000001
>>>> 2016-10-05T07:58:01Z refclock.c:707:(filter_add_sample) filter sample
>>>> 55 t=1475654280.876999978 offset=0.122999928 dispersion=0.000000001
>>>> 2016-10-05T07:58:01Z sources.c:360:(SRC_AccumulateSample) ip=[SOCK]
>>>> t=1475654280.876999978 ofs=-0.123000 del=0.000000 disp=-nan str=0
>>>> chronyd: sourcestats.c:388: find_best_sample_index: Assertion
>>>> `best_index >= 0' failed.
>>>> Aborted (core dumped)
>>>
>>>
>>> Thanks for the bug report. It looks like the problem is that there are
>>> multiple samples with the same local timestamp, which breaks the
>>> calculation of the dispersion.
>>>
>>> Can you please pull from git and see if it's fixed?
>>>
>>> --
>>> Miroslav Lichvar
>>>
>>> --
>>> To unsubscribe email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with
>>> "unsubscribe" in the subject.
>>> For help email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "help" in the
>>> subject.
>>> Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.
>>>
>>
>> --
>> To unsubscribe email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with
>> "unsubscribe" in the subject.
>> For help email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "help" in the
>> subject.
>> Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.
--
To unsubscribe email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "unsubscribe" in the subject.
For help email chrony-dev-request@xxxxxxxxxxxxxxxxxxxx with "help" in the subject.
Trouble? Email listmaster@xxxxxxxxxxxxxxxxxxxx.