Re: [eigen] Benchmarking

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/9/7 Daniel Stonier <d.stonier@xxxxxxxxx>:
> Thanks for the advice lads, I didn't know about the
> CLOCK_PROCESS_CPUTIME_ID either. I'll do some testing with it
> tomorrow.
>
> Also, when making longer loops, to what extent do you have to worry
> about making sure you run the loop differently each time? i.e. is the
> compiler intelligent enough to know that you're sending it around the
> same treadmill in alot of situations?

Yes, this is often a problem. So if you want to repeat  an operation
many times, better put that in a separate function and prevent the
compiler from inlining it (you can use EIGEN_DONT_INLINE).

Benoit

>
> Regards,
> Daniel Stonier.
>
> On 7 September 2010 22:09, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/9/7 Eamon Nerbonne <eamon.nerbonne@xxxxxxxxx>:
>>> Are you sure clockspeed variability and hyperthreaded contention
>>> interference will be eliminated by using the process-CPU-time measurement?
>>
>> This is at least how I understand it, but I'm no expert so I'll
>> thankfully accept numbers/links proving me wrong!
>> Benoit
>>
>>> It certainly sounds like that function measures time not ticks and even if
>>> it measured ticks, with hyperthreading, it's not so clear what that means.
>>> Even for scheduling I wouldn't count on that being precise to
>>> sub-microsecond level without some good testing - I don't have a clue at
>>> which point during a context switch the clock is stopped, so to speak.
>>>
>>> In any case, using longer loops is just easier to get right.
>>>
>>> --eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163
>>>
>>>
>>> On Tue, Sep 7, 2010 at 13:16, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>>
>>>> 2010/9/7 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>> > 2010/9/7 Eamon Nerbonne <eamon.nerbonne@xxxxxxxxx>:
>>>> >> Do you have experience timing things at that level (nanoseconds, that
>>>> >> is)?
>>>> >>
>>>> >> If you're timing things at the microsecond level, you'll get
>>>> >> interference
>>>> >> from cache effects
>>>> >
>>>> > Ah true, the cost of a single RAM access is non-negligible compared to
>>>> > 1 microsecond... making it forever irrelevant to benchmark at that
>>>> > level! At least as long as RAM is involved.
>>>> >
>>>> >> and possibly the scheduler, though you tried to prevent
>>>> >> that (does that prevent I/O kernel time too?).  It is odd that that you
>>>> >> consistently see lower performance for several loop iterations,
>>>> >> however,
>>>> >> since that's not normal cache behavior.  Another factor you might be
>>>> >> running
>>>> >> into: Power-saving cpu speed reduction.  If your clock speed is
>>>> >> throttled,
>>>> >> it may well take a while before the heuristics decide load is high
>>>> >> enough to
>>>> >> unthrottle - or maybe your CPU is hyperthreaded and sharing a core with
>>>> >> another expensive task initially.
>>>> >
>>>> > All of that should be taken care of by using
>>>> > clock_gettime(CLOCK_PROCESS_CPUTIME_ID).
>>>> >
>>>> >>  And of course, depending on the details,
>>>> >> you might be running into other weirdness too such as denormalized
>>>> >> floating
>>>> >> points and NaN/Inf values.
>>>> >
>>>> > Right --- but that isn't specific to timing on a small scale. Can ruin
>>>> > a day-long benchmark, too.
>>>> >
>>>> >> Generally, I make my loops long enough to reach the millisecond range,
>>>> >> and
>>>> >> then re-run them several times; even then you see some possibly
>>>> >> scheduler-related variability.
>>>> >
>>>> > Yes, being in the millisecond range is needed to get something
>>>> > 'statistically significant' wrt RAM accesses.
>>>>
>>>> For the record: yes running in the millisecond range is needed wrt RAM
>>>> accesses, but no I don't think that 'scheduler variability' is a
>>>> potential problem as that should be completely taken care of by
>>>> clock_gettime(CLOCK_PROCESS_CPUTIME_ID).
>>>>
>>>> Benoit
>>>>
>>>> >
>>>> > My other 'trick' is to just use a good profiler that uses the cpu's
>>>> > performance counters. Allows to benchmark any code without having to
>>>> > modify it... On recent linux kernels, use 'perf'.
>>>> >
>>>> > Benoit
>>>> >
>>>> >
>>>> >>
>>>> >> --eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163
>>>> >>
>>>> >>
>>>> >> On Tue, Sep 7, 2010 at 08:15, Daniel Stonier <d.stonier@xxxxxxxxx>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi lads,
>>>> >>>
>>>> >>> I've been trying to benchmark eigen2 and eigen3's geometry modules
>>>> >>> recently just to get an idea of the speed we can run various
>>>> >>> structures at, but I'm having a hard time getting consistent results
>>>> >>> and thought you might be able to lend some advice.
>>>> >>>
>>>> >>> Typically, I do things in the following order on a linux platform with
>>>> >>> rt timers (ie  clock_gettime(CLOCK_MONOTONIC,...))
>>>> >>>
>>>> >>> ###########################################
>>>> >>> set the process as a real time priority posix process
>>>> >>> select transform type
>>>> >>> begin_loop
>>>> >>>  - fill transform with random data
>>>> >>>  - timestamp
>>>> >>>  - do a transform product
>>>> >>>  - timestamp again
>>>> >>>  - push time diff onto a queue
>>>> >>> repeat
>>>> >>> do some statistics
>>>> >>> ###########################################
>>>> >>>
>>>> >>> The times I have coming out are extremely inconsistent though:
>>>> >>>
>>>> >>> - if repeating only 100 times, the product might come out with times
>>>> >>> of ~840-846ns one run, then sometimes 300-310ns on another run.
>>>> >>> - if repeating 10000 times, it will run at ~840ns for a long time,
>>>> >>> then jump down and run at 300-310ns for the remainder.
>>>> >>> - running other tests in the loop as well (taking separate timestamps
>>>> >>> and using multiple queues) can cause the calculation time to be very
>>>> >>> different.
>>>> >>>  - e.g. this test alone produces results of ~600ns, mingled with
>>>> >>> other tests it is usually ~840ns.
>>>> >>>
>>>> >>> Some troubleshooting:
>>>> >>>
>>>> >>> - it is not effects from multi-core as the same problems happen when
>>>> >>> using taskset to lock it onto a single core.
>>>> >>> - it shouldn't be from the scheduler either because it is an elevated
>>>> >>> posix real time process.
>>>> >>>
>>>> >>> I'm baffled. Would really love to know more about how my computer
>>>> >>> processes in such a humanly erratic fashion and what's a good way of
>>>> >>> testing that.
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Daniel Stonier.
>>>> >>>
>>>> >>> --
>>>> >>> Phone : +82-10-5400-3296 (010-5400-3296)
>>>> >>> Home: http://snorriheim.dnsdojo.com/
>>>> >>> Yujin Robot: http://www.yujinrobot.com/
>>>> >>> Embedded Control Libraries:
>>>> >>> http://snorriheim.dnsdojo.com/redmine/wiki/ecl
>>>> >>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
>
> --
> Phone : +82-10-5400-3296 (010-5400-3296)
> Home: http://snorriheim.dnsdojo.com/
> Yujin Robot: http://www.yujinrobot.com/
> Embedded Control Libraries: http://snorriheim.dnsdojo.com/redmine/wiki/ecl
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/