Re: [chrony-users] Resume from suspend and default makestep configuration

[ Thread Index | Date Index | More chrony.tuxfamily.org/chrony-users Archives ]


On Tuesday 19 May 2020 13:40:18 FUSTE Emmanuel wrote:
> Le 19/05/2020 à 15:11, Pali Rohár a écrit :
> > On Tuesday 19 May 2020 12:42:28 FUSTE Emmanuel wrote:
> >> Le 19/05/2020 à 13:30, Pali Rohár a écrit :
> >>> On Tuesday 19 May 2020 11:10:01 FUSTE Emmanuel wrote:
> >>>> Le 19/05/2020 à 12:29, Pali Rohár a écrit :
> >>>>> On Monday 18 May 2020 13:45:04 FUSTE Emmanuel wrote:
> >>>>>> Le 18/05/2020 à 13:15, Pali Rohár a écrit :
> >>>>>>> On Monday 18 May 2020 10:45:02 FUSTE Emmanuel wrote:
> >>>>>>>> Hello Pali,
> >>>>>>>>
> >>>>>>>> Le 18/05/2020 à 12:37, Pali Rohár a écrit :
> >>>>>>>>> The main problem is when system is put into suspend or hibernate state.
> >>>>>>>>>
> >>>>>>>>> In my opinion resuming from suspend / hibernate state should be handled
> >>>>>>>>> in the same way as (re)starting chronyd. You do not know what may
> >>>>>>>>> happened during sleep.
> >>>>>>>> Yes and in case of needed workaround, it should be done at the system
> >>>>>>>> level, not chrony.
> >>>>>>>> A job for systemd.
> >>>>>>> Hello! Sorry for a stupid question, but what has systemd in common with
> >>>>>>> chronyd? Why should systemd care about chronyd time synchronization?
> >>>>>> Nothing.
> >>>>>> But it is to your "process manager" being systemd, sysvinit pile of
> >>>>>> scripts or whatever to restart or notify chrony, it has do do
> >>>>>> housekeeping anyway for other things when you suspend/resume.
> >>>>> Hm... I remember that in past it was needed to blacklist broken daemons,
> >>>>> software and kernel modules which did not work correctly during S3 or
> >>>>> hibernate state. It was in some pm scripts utils...
> >>>>>
> >>>>> But I thought that these days are already passed and software can deal
> >>>>> with fact that machine may be put into suspend or hibernate state.
> >>>>>
> >>>>> So what you are suggesting is to put chronyd daemon into list of broken
> >>>>> software (which needs to be stopped prior suspend / resume)?
> >>>>>
> >>>>> It does not make sense for me as the immediate step after putting
> >>>>> software or kernel module into such "blacklist" was to inform upstream
> >>>>> authors of that daemon or kernel module they it is broken / incompatible
> >>>>> with suspend state and it should be fixed.
> >>>>>
> >>>>> That "blacklist" was just workaround for buggy software and not
> >>>>> permanent solution.
> >>>> No not chrony, but the machine which change RTC on your back : buggy Bios
> >>> Sorry, but I have not caught this line. Blacklist contained list of
> >>> buggy software, daemons and kernel modules which had to be (in past)
> >>> stopped / unloaded prior system went to S3 and started / (re)loaded
> >>> after system resumed. So obviously putting "buggy Bios" into blacklist
> >>> not only does not make sense, but also it did nothing. In that
> >>> particular case chronyd had to be put into that blacklist of buggy
> >>> software as it as you described is chronyd which needs to be stopped /
> >>> started... But as I said this was used in past when buggy software and
> >>> kernel modules were there when they was not able to correctly handle S3
> >>> state.
> >> I said the machine not chrony.
> >> Please I'm not native English, but this conversation became more and
> >> more like a trooling one.
> >> Blacklist are black list, this is a generic term as you point out.
> > Sorry for that. Lets call it just list. If you want to somehow use
> > machine in that list, then you probably need tuple <software, machine>
> > and teach scripts around to read that list as tuple and restart
> > "software" if "machine" matches string of current machine on which it is
> > running.
> Yes and software in this case is "software that provide time sync"
> >
> > I'm saying that in past this was just list of "buggy" software and
> > kernel modules which needs to be restarted during S3. It was not some
> > smart structure where you was able to define rules like "if you are
> > running on machine ABC then restart software CDE". And this is I guess
> > what you want to achieve by putting machine on list.
> >
> >>>>>> Exactly as networkmanager, ifupdown scripts, systemd-networkd
> >>>>>> reload/restart some network services when interfaces/tunnels/vpn are
> >>>>>> upped/downed.
> >>>>> This is something totally different. all those mentioned "services" are
> >>>>> just independent part of system which manages network connections.
> >>>>>
> >>>>> chronyd is there to manage time synchronization.
> >>>> It was an "imaged comparison" for event driven config change.
> >>>> The event in the suspend vs time case,  the event is only know and
> >>>> should be managed by your init system not by your time daemon.
> >>>>
> >>>>>>>>> And as I pointed there are existing problems that UEFI/BIOS firmware
> >>>>>>>>> changes RTC clock without good reason which results in completely wrong
> >>>>>>>>> system clock.
> >>>>>>>>>
> >>>>>>>> Could well be identified by blacklist at the udev/systemd level for
> >>>>>>>> applying or not the workaround (restart chrony or launch a chronyc
> >>>>>>>> command at resume)
> >>>>>>> Could you describe in details what do you mean by blacklist? Which udev
> >>>>>>> blacklist you mean and what should be put into that blacklist? I have
> >>>>>>> not caught this part.
> >>>>>> Faulty systems could be identified by DMI/ACPI strings and quirk applied.
> >>>>> And what is the faulty system?
> >>>> Citing yourself :
> >>>>
> >>>> "as I pointed there are existing problems that UEFI/BIOS firmware
> >>>> changes RTC clock without good reason"
> >>> Ok. Main problem is that there is no way how to identify such broken
> >>> firmwares. So definition is now nice and clear but basically useless as
> >>> it does not say anything how to find or identify such faulty systems.
> >> Yes that is the generic problem of faulty hw/devices/firmware, they are
> >> faulty but not on purpose.
> >> The kernel is full of theses lists. And they are build by hand with
> >> users/developers feedback but you know that in the Bt too world isn't it ?
> > Yes, but in most cases you can identify faulty devices, e.g. by seeing
> > that it does something unexpected or that it reacts wrongly in some
> > situation. Or that you can find reports about same issue by other
> > people.
> >
> > But this is something different. If faulty machine is DST related, it
> > happens only twice a year (if you have a luck). And is hard to identify
> > or catch this problem in action. Plus timing issues and race conditions
> > which happens in specific time are really hard to trigger or reproduce.
> >
> > I have already searched and have not found that somebody would start
> > creating list of "problematic machines" with this particular issue.
> >
> >>>>> I think this is something general and not related to particular machine.
> >>>>> I guess under specific conditions it may happen on any system.
> >>>>>
> >>>>>> See for example /lib/udev/hwdb.d/60-sensor.hwdb  for some laptop sensors.
> >>>>>> We could add an attribute to the RTC if it matche some vendor/bios
> >>>>>> version/model etc... to put in the hwdb (the blacklist)
> >>>>>> A udev rule will assign this attribute to the RTC if you are running on
> >>>>>> a known buggy system.
> >>>>>> A script could do anything you want at suspend/resume time in
> >>>>>> /lib/systemd/system-sleep if your RTC has the offended attribute (see
> >>>>>> systemd-sleep man page).
> >>>>>> Or better, a unit run at resume time could do anything too.
> >>>>>> The hwdb abstraction is not need if it is a local hack and should be
> >>>>>> properly defined with the hwdb/udev/systemd developers.
> >>>>> This database is for describing hardware differences or issues.
> >>>>>
> >>>>> But above problem with time synchronization is general and hardware
> >>>>> independent. You can simulate same issue on your machine.
> >>>>>
> >>>>> Just put your computer into hibernation. Then boot from liveUSB some
> >>>>> Linxu distribution and change RTC time. Turn off liveUSB and boot your
> >>>>> hibernated system. And you should be in same situation as I described.
> >>>> Yes but this is like shooting yourself in your feet.
> >>> This is just test case, so you can check, "simulate" and reproduce
> >>> this issue even without "faulty machine".
> >> OK
> >>> Moreover, Windows systems used to store RTC in local time and Linux
> >>> systems in UTC. I do not know if this still applies but basically multi
> >>> OS machines are affected by the same issue.
> >>>
> >>>> If you want to be robust in this case and all others, then by default
> >>>> you must restart ANY time sync daemon in the resume callback of your
> >>>> init system, being ntpd or chrony, systemd or sysvinit or upstart or
> >>>> anything else. But it is problematic as Miroslav point out as you
> >>>> potentially start to trust any anonymous time source more than your own RTC.
> >>> What is problematic here? Your RTC may be also shifted as I pointed, so
> >>> it has same trust source as any other anonymous source.
> >>>
> >>> Also what is difference between trusting those "anonymous time source"
> >>> at chronyd startup time and at time when resuming your machine from
> >>> suspend / hibernate?
> >>>
> >>> For me it does not make sense to say that "anonymous time source" is
> >>> fully trusted when starting chronyd at computer startup time. But same
> >>> "anonymous time source" is untrusted at computer resume from hibernate
> >>> time.
> >> Tradeoff, it is already bad. Ntpd startup scripts used to not call
> >> ntpdate and bail out in case of too big discrepancy (2s or 3s from
> >> memory). It is considered too user unfriendly without proper IHM
> >> interaction.
> >> But doing that at boot time only is better than at boot time AND at
> >> resume time.
> > And why it is better? I still do not see it. In both cases you are
> > starting machine from "unknown" RTC state and also in both cases you do
> > not know what is correct UTC time. You even do not know how long you
> > have been in sleep (or power off state).
> >
> > Also when resuming from hibernation you may have been completely powered
> > off and also memory of system may have been modified. Plus multiOS
> > scenario may have applied, e.g. ordinary user just "booted" windows and
> > then turned it off and resumed linux from hibernation. I guess we would
> > agree that ordinary user does not use any virtualisation as you
> > described below.
> >
> >> And  better than trusting any source at any time.
> >>>> The actual makestep value is a sane default for all the majority of sane
> >>>> machine with standard usecase.
> >>> So multi-OS scenario is not standard (anymore)?
> >> It never have been. It works, with some limitations and tradeoff.
> > By multi-OS I mean classic dual-boot scenario. When only one OS is
> > running at the same time. But different OSes interpret RTC clock in
> > different way, some in local timezones (plus in different) and some in
> > UTC.
> Yes me too.
> > In past I lot of time seen problem that Windows stored system time in
> > local timezone to RTC, then computer was rebooted to Linux which reads
> > system time from RTC in UTC and saw incorrect time. Installing NTP
> > daemon fixed this problem. And then after reboot Windows time was
> > shifted and after few seconds/minutes it synchronized it again against
> > Windows time server.
> A better workaround: just intruct linux that RTC is in localtimezone and 
> not UTC and it would have worked.

I remember that this setup did not work in one case: when linux system
was booted prior booting windows system after DST change. Time was
shifted two times, once by linux, once by windows as windows did not
know that it should do it...

> At theses times, WTS was just a blind ntpdate equivalent for information 
> (one step sync time to time).
> >
> >> RTC could not be R/W shared by essence. It is realtime.
> >> You could not save it's state and restore it  later or you must have
> >> special HW/Firmware that "virtualize" it and is able to maintain "per
> >> os" state.
> >> And the different historic direct/indirect usages of  the RTC on PC
> >> complicate things to a dead end.
> >> I would never entrust this task to a pc bios ...
> >> Time related, the only working multi-OS scenario is under an hypervisor,
> >> because it could arbitrate the access to the hardware and is the only
> >> one messing with the real(time) hardware.
> > Yes, in case that more OSes are running, the only option is to provide
> > virtual RTC to systems via hypervisor and let hypervisor to ensure
> > atomicity.
> >
> >> In this case, you synchronise your hypervisor with external sources and
> >> provide para-virtualised system clock or virtual ptp clock to your
> >> guests for system clock sync. If you have to run ntp, your hypervisor
> >> must be your source with any conf your want : it is a trustable source.
> >> A virtual RTC is provided to your guest if it need one.
> >> That is the only multi-OS standard/working scenario from a timekeeping
> >> point on view in the PC world.
> >>
> >> Emmanuel.

-- 
Pali Rohár
pali.rohar@xxxxxxxxx

-- 
To unsubscribe email chrony-users-request@xxxxxxxxxxxxxxxxxxxx 
with "unsubscribe" in the subject.
For help email chrony-users-request@xxxxxxxxxxxxxxxxxxxx 
with "help" in the subject.
Trouble?  Email listmaster@xxxxxxxxxxxxxxxxxxxx.


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/