Re: [eigen] Benchmarking

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Are you sure clockspeed variability and hyperthreaded contention interference will be eliminated by using the process-CPU-time measurement?  It certainly sounds like that function measures time not ticks and even if it measured ticks, with hyperthreading, it's not so clear what that means. Even for scheduling I wouldn't count on that being precise to sub-microsecond level without some good testing - I don't have a clue at which point during a context switch the clock is stopped, so to speak.

In any case, using longer loops is just easier to get right.

--eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163


On Tue, Sep 7, 2010 at 13:16, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
2010/9/7 Benoit Jacob <jacob..benoit.1@xxxxxxxxx>:
> 2010/9/7 Eamon Nerbonne <eamon.nerbonne@xxxxxxxxx>:
>> Do you have experience timing things at that level (nanoseconds, that is)?
>>
>> If you're timing things at the microsecond level, you'll get interference
>> from cache effects
>
> Ah true, the cost of a single RAM access is non-negligible compared to
> 1 microsecond... making it forever irrelevant to benchmark at that
> level! At least as long as RAM is involved.
>
>> and possibly the scheduler, though you tried to prevent
>> that (does that prevent I/O kernel time too?).  It is odd that that you
>> consistently see lower performance for several loop iterations, however,
>> since that's not normal cache behavior.  Another factor you might be running
>> into: Power-saving cpu speed reduction.  If your clock speed is throttled,
>> it may well take a while before the heuristics decide load is high enough to
>> unthrottle - or maybe your CPU is hyperthreaded and sharing a core with
>> another expensive task initially.
>
> All of that should be taken care of by using
> clock_gettime(CLOCK_PROCESS_CPUTIME_ID).
>
>>  And of course, depending on the details,
>> you might be running into other weirdness too such as denormalized floating
>> points and NaN/Inf values.
>
> Right --- but that isn't specific to timing on a small scale. Can ruin
> a day-long benchmark, too.
>
>> Generally, I make my loops long enough to reach the millisecond range, and
>> then re-run them several times; even then you see some possibly
>> scheduler-related variability.
>
> Yes, being in the millisecond range is needed to get something
> 'statistically significant' wrt RAM accesses.

For the record: yes running in the millisecond range is needed wrt RAM
accesses, but no I don't think that 'scheduler variability' is a
potential problem as that should be completely taken care of by
clock_gettime(CLOCK_PROCESS_CPUTIME_ID).

Benoit

>
> My other 'trick' is to just use a good profiler that uses the cpu's
> performance counters. Allows to benchmark any code without having to
> modify it... On recent linux kernels, use 'perf'.
>
> Benoit
>
>
>>
>> --eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163
>>
>>
>> On Tue, Sep 7, 2010 at 08:15, Daniel Stonier <d.stonier@xxxxxxxxx> wrote:
>>>
>>> Hi lads,
>>>
>>> I've been trying to benchmark eigen2 and eigen3's geometry modules
>>> recently just to get an idea of the speed we can run various
>>> structures at, but I'm having a hard time getting consistent results
>>> and thought you might be able to lend some advice.
>>>
>>> Typically, I do things in the following order on a linux platform with
>>> rt timers (ie  clock_gettime(CLOCK_MONOTONIC,...))
>>>
>>> ###########################################
>>> set the process as a real time priority posix process
>>> select transform type
>>> begin_loop
>>>  - fill transform with random data
>>>  - timestamp
>>>  - do a transform product
>>>  - timestamp again
>>>  - push time diff onto a queue
>>> repeat
>>> do some statistics
>>> ###########################################
>>>
>>> The times I have coming out are extremely inconsistent though:
>>>
>>> - if repeating only 100 times, the product might come out with times
>>> of ~840-846ns one run, then sometimes 300-310ns on another run..
>>> - if repeating 10000 times, it will run at ~840ns for a long time,
>>> then jump down and run at 300-310ns for the remainder.
>>> - running other tests in the loop as well (taking separate timestamps
>>> and using multiple queues) can cause the calculation time to be very
>>> different.
>>>  - e.g. this test alone produces results of ~600ns, mingled with
>>> other tests it is usually ~840ns.
>>>
>>> Some troubleshooting:
>>>
>>> - it is not effects from multi-core as the same problems happen when
>>> using taskset to lock it onto a single core.
>>> - it shouldn't be from the scheduler either because it is an elevated
>>> posix real time process.
>>>
>>> I'm baffled. Would really love to know more about how my computer
>>> processes in such a humanly erratic fashion and what's a good way of
>>> testing that.
>>>
>>> Cheers,
>>> Daniel Stonier.
>>>
>>> --
>>> Phone : +82-10-5400-3296 (010-5400-3296)
>>> Home: http://snorriheim.dnsdojo.com/
>>> Yujin Robot: http://www.yujinrobot.com/
>>> Embedded Control Libraries: http://snorriheim.dnsdojo.com/redmine/wiki/ecl
>>>
>>>
>>
>>
>





Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/