Re: [eigen] Benchmarking

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/9/7 Eamon Nerbonne <eamon.nerbonne@xxxxxxxxx>:
> Are you sure clockspeed variability and hyperthreaded contention
> interference will be eliminated by using the process-CPU-time measurement?

This is at least how I understand it, but I'm no expert so I'll
thankfully accept numbers/links proving me wrong!
Benoit

> It certainly sounds like that function measures time not ticks and even if
> it measured ticks, with hyperthreading, it's not so clear what that means..
> Even for scheduling I wouldn't count on that being precise to
> sub-microsecond level without some good testing - I don't have a clue at
> which point during a context switch the clock is stopped, so to speak.
>
> In any case, using longer loops is just easier to get right.
>
> --eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163
>
>
> On Tue, Sep 7, 2010 at 13:16, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>
>> 2010/9/7 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> > 2010/9/7 Eamon Nerbonne <eamon.nerbonne@xxxxxxxxx>:
>> >> Do you have experience timing things at that level (nanoseconds, that
>> >> is)?
>> >>
>> >> If you're timing things at the microsecond level, you'll get
>> >> interference
>> >> from cache effects
>> >
>> > Ah true, the cost of a single RAM access is non-negligible compared to
>> > 1 microsecond... making it forever irrelevant to benchmark at that
>> > level! At least as long as RAM is involved.
>> >
>> >> and possibly the scheduler, though you tried to prevent
>> >> that (does that prevent I/O kernel time too?).  It is odd that that you
>> >> consistently see lower performance for several loop iterations,
>> >> however,
>> >> since that's not normal cache behavior.  Another factor you might be
>> >> running
>> >> into: Power-saving cpu speed reduction.  If your clock speed is
>> >> throttled,
>> >> it may well take a while before the heuristics decide load is high
>> >> enough to
>> >> unthrottle - or maybe your CPU is hyperthreaded and sharing a core with
>> >> another expensive task initially.
>> >
>> > All of that should be taken care of by using
>> > clock_gettime(CLOCK_PROCESS_CPUTIME_ID).
>> >
>> >>  And of course, depending on the details,
>> >> you might be running into other weirdness too such as denormalized
>> >> floating
>> >> points and NaN/Inf values.
>> >
>> > Right --- but that isn't specific to timing on a small scale. Can ruin
>> > a day-long benchmark, too.
>> >
>> >> Generally, I make my loops long enough to reach the millisecond range,
>> >> and
>> >> then re-run them several times; even then you see some possibly
>> >> scheduler-related variability.
>> >
>> > Yes, being in the millisecond range is needed to get something
>> > 'statistically significant' wrt RAM accesses.
>>
>> For the record: yes running in the millisecond range is needed wrt RAM
>> accesses, but no I don't think that 'scheduler variability' is a
>> potential problem as that should be completely taken care of by
>> clock_gettime(CLOCK_PROCESS_CPUTIME_ID).
>>
>> Benoit
>>
>> >
>> > My other 'trick' is to just use a good profiler that uses the cpu's
>> > performance counters. Allows to benchmark any code without having to
>> > modify it... On recent linux kernels, use 'perf'.
>> >
>> > Benoit
>> >
>> >
>> >>
>> >> --eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163
>> >>
>> >>
>> >> On Tue, Sep 7, 2010 at 08:15, Daniel Stonier <d.stonier@xxxxxxxxx>
>> >> wrote:
>> >>>
>> >>> Hi lads,
>> >>>
>> >>> I've been trying to benchmark eigen2 and eigen3's geometry modules
>> >>> recently just to get an idea of the speed we can run various
>> >>> structures at, but I'm having a hard time getting consistent results
>> >>> and thought you might be able to lend some advice.
>> >>>
>> >>> Typically, I do things in the following order on a linux platform with
>> >>> rt timers (ie  clock_gettime(CLOCK_MONOTONIC,...))
>> >>>
>> >>> ###########################################
>> >>> set the process as a real time priority posix process
>> >>> select transform type
>> >>> begin_loop
>> >>>  - fill transform with random data
>> >>>  - timestamp
>> >>>  - do a transform product
>> >>>  - timestamp again
>> >>>  - push time diff onto a queue
>> >>> repeat
>> >>> do some statistics
>> >>> ###########################################
>> >>>
>> >>> The times I have coming out are extremely inconsistent though:
>> >>>
>> >>> - if repeating only 100 times, the product might come out with times
>> >>> of ~840-846ns one run, then sometimes 300-310ns on another run.
>> >>> - if repeating 10000 times, it will run at ~840ns for a long time,
>> >>> then jump down and run at 300-310ns for the remainder.
>> >>> - running other tests in the loop as well (taking separate timestamps
>> >>> and using multiple queues) can cause the calculation time to be very
>> >>> different.
>> >>>  - e.g. this test alone produces results of ~600ns, mingled with
>> >>> other tests it is usually ~840ns.
>> >>>
>> >>> Some troubleshooting:
>> >>>
>> >>> - it is not effects from multi-core as the same problems happen when
>> >>> using taskset to lock it onto a single core.
>> >>> - it shouldn't be from the scheduler either because it is an elevated
>> >>> posix real time process.
>> >>>
>> >>> I'm baffled. Would really love to know more about how my computer
>> >>> processes in such a humanly erratic fashion and what's a good way of
>> >>> testing that.
>> >>>
>> >>> Cheers,
>> >>> Daniel Stonier.
>> >>>
>> >>> --
>> >>> Phone : +82-10-5400-3296 (010-5400-3296)
>> >>> Home: http://snorriheim.dnsdojo.com/
>> >>> Yujin Robot: http://www.yujinrobot.com/
>> >>> Embedded Control Libraries:
>> >>> http://snorriheim.dnsdojo.com/redmine/wiki/ecl
>> >>>
>> >>>
>> >>
>> >>
>> >
>>
>>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/