Re: [chrony-users] Resume from suspend and default makestep configuration

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


Le 19/05/2020 à 15:11, Pali Rohár a écrit :
> On Tuesday 19 May 2020 12:42:28 FUSTE Emmanuel wrote:
>> Le 19/05/2020 à 13:30, Pali Rohár a écrit :
>>> On Tuesday 19 May 2020 11:10:01 FUSTE Emmanuel wrote:
>>>> Le 19/05/2020 à 12:29, Pali Rohár a écrit :
>>>>> On Monday 18 May 2020 13:45:04 FUSTE Emmanuel wrote:
>>>>>> Le 18/05/2020 à 13:15, Pali Rohár a écrit :
>>>>>>> On Monday 18 May 2020 10:45:02 FUSTE Emmanuel wrote:
>>>>>>>> Hello Pali,
>>>>>>>>
>>>>>>>> Le 18/05/2020 à 12:37, Pali Rohár a écrit :
>>>>>>>>> The main problem is when system is put into suspend or hibernate state.
>>>>>>>>>
>>>>>>>>> In my opinion resuming from suspend / hibernate state should be handled
>>>>>>>>> in the same way as (re)starting chronyd. You do not know what may
>>>>>>>>> happened during sleep.
>>>>>>>> Yes and in case of needed workaround, it should be done at the system
>>>>>>>> level, not chrony.
>>>>>>>> A job for systemd.
>>>>>>> Hello! Sorry for a stupid question, but what has systemd in common with
>>>>>>> chronyd? Why should systemd care about chronyd time synchronization?
>>>>>> Nothing.
>>>>>> But it is to your "process manager" being systemd, sysvinit pile of
>>>>>> scripts or whatever to restart or notify chrony, it has do do
>>>>>> housekeeping anyway for other things when you suspend/resume.
>>>>> Hm... I remember that in past it was needed to blacklist broken daemons,
>>>>> software and kernel modules which did not work correctly during S3 or
>>>>> hibernate state. It was in some pm scripts utils...
>>>>>
>>>>> But I thought that these days are already passed and software can deal
>>>>> with fact that machine may be put into suspend or hibernate state.
>>>>>
>>>>> So what you are suggesting is to put chronyd daemon into list of broken
>>>>> software (which needs to be stopped prior suspend / resume)?
>>>>>
>>>>> It does not make sense for me as the immediate step after putting
>>>>> software or kernel module into such "blacklist" was to inform upstream
>>>>> authors of that daemon or kernel module they it is broken / incompatible
>>>>> with suspend state and it should be fixed.
>>>>>
>>>>> That "blacklist" was just workaround for buggy software and not
>>>>> permanent solution.
>>>> No not chrony, but the machine which change RTC on your back : buggy Bios
>>> Sorry, but I have not caught this line. Blacklist contained list of
>>> buggy software, daemons and kernel modules which had to be (in past)
>>> stopped / unloaded prior system went to S3 and started / (re)loaded
>>> after system resumed. So obviously putting "buggy Bios" into blacklist
>>> not only does not make sense, but also it did nothing. In that
>>> particular case chronyd had to be put into that blacklist of buggy
>>> software as it as you described is chronyd which needs to be stopped /
>>> started... But as I said this was used in past when buggy software and
>>> kernel modules were there when they was not able to correctly handle S3
>>> state.
>> I said the machine not chrony.
>> Please I'm not native English, but this conversation became more and
>> more like a trooling one.
>> Blacklist are black list, this is a generic term as you point out.
> Sorry for that. Lets call it just list. If you want to somehow use
> machine in that list, then you probably need tuple <software, machine>
> and teach scripts around to read that list as tuple and restart
> "software" if "machine" matches string of current machine on which it is
> running.
Yes and software in this case is "software that provide time sync"
>
> I'm saying that in past this was just list of "buggy" software and
> kernel modules which needs to be restarted during S3. It was not some
> smart structure where you was able to define rules like "if you are
> running on machine ABC then restart software CDE". And this is I guess
> what you want to achieve by putting machine on list.
>
>>>>>> Exactly as networkmanager, ifupdown scripts, systemd-networkd
>>>>>> reload/restart some network services when interfaces/tunnels/vpn are
>>>>>> upped/downed.
>>>>> This is something totally different. all those mentioned "services" are
>>>>> just independent part of system which manages network connections.
>>>>>
>>>>> chronyd is there to manage time synchronization.
>>>> It was an "imaged comparison" for event driven config change.
>>>> The event in the suspend vs time case,  the event is only know and
>>>> should be managed by your init system not by your time daemon.
>>>>
>>>>>>>>> And as I pointed there are existing problems that UEFI/BIOS firmware
>>>>>>>>> changes RTC clock without good reason which results in completely wrong
>>>>>>>>> system clock.
>>>>>>>>>
>>>>>>>> Could well be identified by blacklist at the udev/systemd level for
>>>>>>>> applying or not the workaround (restart chrony or launch a chronyc
>>>>>>>> command at resume)
>>>>>>> Could you describe in details what do you mean by blacklist? Which udev
>>>>>>> blacklist you mean and what should be put into that blacklist? I have
>>>>>>> not caught this part.
>>>>>> Faulty systems could be identified by DMI/ACPI strings and quirk applied.
>>>>> And what is the faulty system?
>>>> Citing yourself :
>>>>
>>>> "as I pointed there are existing problems that UEFI/BIOS firmware
>>>> changes RTC clock without good reason"
>>> Ok. Main problem is that there is no way how to identify such broken
>>> firmwares. So definition is now nice and clear but basically useless as
>>> it does not say anything how to find or identify such faulty systems.
>> Yes that is the generic problem of faulty hw/devices/firmware, they are
>> faulty but not on purpose.
>> The kernel is full of theses lists. And they are build by hand with
>> users/developers feedback but you know that in the Bt too world isn't it ?
> Yes, but in most cases you can identify faulty devices, e.g. by seeing
> that it does something unexpected or that it reacts wrongly in some
> situation. Or that you can find reports about same issue by other
> people.
>
> But this is something different. If faulty machine is DST related, it
> happens only twice a year (if you have a luck). And is hard to identify
> or catch this problem in action. Plus timing issues and race conditions
> which happens in specific time are really hard to trigger or reproduce.
>
> I have already searched and have not found that somebody would start
> creating list of "problematic machines" with this particular issue.
>
>>>>> I think this is something general and not related to particular machine.
>>>>> I guess under specific conditions it may happen on any system.
>>>>>
>>>>>> See for example /lib/udev/hwdb.d/60-sensor.hwdb  for some laptop sensors.
>>>>>> We could add an attribute to the RTC if it matche some vendor/bios
>>>>>> version/model etc... to put in the hwdb (the blacklist)
>>>>>> A udev rule will assign this attribute to the RTC if you are running on
>>>>>> a known buggy system.
>>>>>> A script could do anything you want at suspend/resume time in
>>>>>> /lib/systemd/system-sleep if your RTC has the offended attribute (see
>>>>>> systemd-sleep man page).
>>>>>> Or better, a unit run at resume time could do anything too.
>>>>>> The hwdb abstraction is not need if it is a local hack and should be
>>>>>> properly defined with the hwdb/udev/systemd developers.
>>>>> This database is for describing hardware differences or issues.
>>>>>
>>>>> But above problem with time synchronization is general and hardware
>>>>> independent. You can simulate same issue on your machine.
>>>>>
>>>>> Just put your computer into hibernation. Then boot from liveUSB some
>>>>> Linxu distribution and change RTC time. Turn off liveUSB and boot your
>>>>> hibernated system. And you should be in same situation as I described.
>>>> Yes but this is like shooting yourself in your feet.
>>> This is just test case, so you can check, "simulate" and reproduce
>>> this issue even without "faulty machine".
>> OK
>>> Moreover, Windows systems used to store RTC in local time and Linux
>>> systems in UTC. I do not know if this still applies but basically multi
>>> OS machines are affected by the same issue.
>>>
>>>> If you want to be robust in this case and all others, then by default
>>>> you must restart ANY time sync daemon in the resume callback of your
>>>> init system, being ntpd or chrony, systemd or sysvinit or upstart or
>>>> anything else. But it is problematic as Miroslav point out as you
>>>> potentially start to trust any anonymous time source more than your own RTC.
>>> What is problematic here? Your RTC may be also shifted as I pointed, so
>>> it has same trust source as any other anonymous source.
>>>
>>> Also what is difference between trusting those "anonymous time source"
>>> at chronyd startup time and at time when resuming your machine from
>>> suspend / hibernate?
>>>
>>> For me it does not make sense to say that "anonymous time source" is
>>> fully trusted when starting chronyd at computer startup time. But same
>>> "anonymous time source" is untrusted at computer resume from hibernate
>>> time.
>> Tradeoff, it is already bad. Ntpd startup scripts used to not call
>> ntpdate and bail out in case of too big discrepancy (2s or 3s from
>> memory). It is considered too user unfriendly without proper IHM
>> interaction.
>> But doing that at boot time only is better than at boot time AND at
>> resume time.
> And why it is better? I still do not see it. In both cases you are
> starting machine from "unknown" RTC state and also in both cases you do
> not know what is correct UTC time. You even do not know how long you
> have been in sleep (or power off state).
>
> Also when resuming from hibernation you may have been completely powered
> off and also memory of system may have been modified. Plus multiOS
> scenario may have applied, e.g. ordinary user just "booted" windows and
> then turned it off and resumed linux from hibernation. I guess we would
> agree that ordinary user does not use any virtualisation as you
> described below.
>
>> And  better than trusting any source at any time.
>>>> The actual makestep value is a sane default for all the majority of sane
>>>> machine with standard usecase.
>>> So multi-OS scenario is not standard (anymore)?
>> It never have been. It works, with some limitations and tradeoff.
> By multi-OS I mean classic dual-boot scenario. When only one OS is
> running at the same time. But different OSes interpret RTC clock in
> different way, some in local timezones (plus in different) and some in
> UTC.
Yes me too.
> In past I lot of time seen problem that Windows stored system time in
> local timezone to RTC, then computer was rebooted to Linux which reads
> system time from RTC in UTC and saw incorrect time. Installing NTP
> daemon fixed this problem. And then after reboot Windows time was
> shifted and after few seconds/minutes it synchronized it again against
> Windows time server.
A better workaround: just intruct linux that RTC is in localtimezone and 
not UTC and it would have worked.
At theses times, WTS was just a blind ntpdate equivalent for information 
(one step sync time to time).
>
>> RTC could not be R/W shared by essence. It is realtime.
>> You could not save it's state and restore it  later or you must have
>> special HW/Firmware that "virtualize" it and is able to maintain "per
>> os" state.
>> And the different historic direct/indirect usages of  the RTC on PC
>> complicate things to a dead end.
>> I would never entrust this task to a pc bios ...
>> Time related, the only working multi-OS scenario is under an hypervisor,
>> because it could arbitrate the access to the hardware and is the only
>> one messing with the real(time) hardware.
> Yes, in case that more OSes are running, the only option is to provide
> virtual RTC to systems via hypervisor and let hypervisor to ensure
> atomicity.
>
>> In this case, you synchronise your hypervisor with external sources and
>> provide para-virtualised system clock or virtual ptp clock to your
>> guests for system clock sync. If you have to run ntp, your hypervisor
>> must be your source with any conf your want : it is a trustable source.
>> A virtual RTC is provided to your guest if it need one.
>> That is the only multi-OS standard/working scenario from a timekeeping
>> point on view in the PC world.
>>
>> Emmanuel.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/