[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Benchmarking
- From: Daniel Stonier <d.stonier@xxxxxxxxx>
- Date: Tue, 7 Sep 2010 23:04:41 +0900
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=/Z9+jQGda6+Y6jsrFIGQLutfKHkZlByiCqf7Du/A578=; b=OGCml2bkynVqIljH4SOxdkMGcRP1QrT6tv8PVzslWtYM64yxp3kSRnVFLyqQsk4Z27 TSEV0cKU5nLDCR2TWiWvIR9V/PAPQ+ptXnXXINlWTO9/u+WzmF9BGnY/M9nnWukdABWA dbRNbUDZFG6joaoHib6k9MzZT+2gCOcDkzCW8=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=u5TzuGeChoNqvmCJjuEywgYfa22FfUwHqobPyp6MVT3V+971R+DeLONlYpHXOMQsC+ /jLZfiUIAq89I4VSnCzHNJN6j8N6gJnXWt3Df8UJhUksHfhDIg5k/1QQYHxujgaKQ5V9 mB0ez7T61Ucco8lCOehhLZs2inj2LgbvZUWx0=
Thanks for the advice lads, I didn't know about the
CLOCK_PROCESS_CPUTIME_ID either. I'll do some testing with it
tomorrow.
Also, when making longer loops, to what extent do you have to worry
about making sure you run the loop differently each time? i.e. is the
compiler intelligent enough to know that you're sending it around the
same treadmill in alot of situations?
Regards,
Daniel Stonier.
On 7 September 2010 22:09, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/9/7 Eamon Nerbonne <eamon.nerbonne@xxxxxxxxx>:
>> Are you sure clockspeed variability and hyperthreaded contention
>> interference will be eliminated by using the process-CPU-time measurement?
>
> This is at least how I understand it, but I'm no expert so I'll
> thankfully accept numbers/links proving me wrong!
> Benoit
>
>> It certainly sounds like that function measures time not ticks and even if
>> it measured ticks, with hyperthreading, it's not so clear what that means.
>> Even for scheduling I wouldn't count on that being precise to
>> sub-microsecond level without some good testing - I don't have a clue at
>> which point during a context switch the clock is stopped, so to speak.
>>
>> In any case, using longer loops is just easier to get right.
>>
>> --eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163
>>
>>
>> On Tue, Sep 7, 2010 at 13:16, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>
>>> 2010/9/7 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> > 2010/9/7 Eamon Nerbonne <eamon.nerbonne@xxxxxxxxx>:
>>> >> Do you have experience timing things at that level (nanoseconds, that
>>> >> is)?
>>> >>
>>> >> If you're timing things at the microsecond level, you'll get
>>> >> interference
>>> >> from cache effects
>>> >
>>> > Ah true, the cost of a single RAM access is non-negligible compared to
>>> > 1 microsecond... making it forever irrelevant to benchmark at that
>>> > level! At least as long as RAM is involved.
>>> >
>>> >> and possibly the scheduler, though you tried to prevent
>>> >> that (does that prevent I/O kernel time too?). It is odd that that you
>>> >> consistently see lower performance for several loop iterations,
>>> >> however,
>>> >> since that's not normal cache behavior. Another factor you might be
>>> >> running
>>> >> into: Power-saving cpu speed reduction. If your clock speed is
>>> >> throttled,
>>> >> it may well take a while before the heuristics decide load is high
>>> >> enough to
>>> >> unthrottle - or maybe your CPU is hyperthreaded and sharing a core with
>>> >> another expensive task initially.
>>> >
>>> > All of that should be taken care of by using
>>> > clock_gettime(CLOCK_PROCESS_CPUTIME_ID).
>>> >
>>> >> And of course, depending on the details,
>>> >> you might be running into other weirdness too such as denormalized
>>> >> floating
>>> >> points and NaN/Inf values.
>>> >
>>> > Right --- but that isn't specific to timing on a small scale. Can ruin
>>> > a day-long benchmark, too.
>>> >
>>> >> Generally, I make my loops long enough to reach the millisecond range,
>>> >> and
>>> >> then re-run them several times; even then you see some possibly
>>> >> scheduler-related variability.
>>> >
>>> > Yes, being in the millisecond range is needed to get something
>>> > 'statistically significant' wrt RAM accesses.
>>>
>>> For the record: yes running in the millisecond range is needed wrt RAM
>>> accesses, but no I don't think that 'scheduler variability' is a
>>> potential problem as that should be completely taken care of by
>>> clock_gettime(CLOCK_PROCESS_CPUTIME_ID).
>>>
>>> Benoit
>>>
>>> >
>>> > My other 'trick' is to just use a good profiler that uses the cpu's
>>> > performance counters. Allows to benchmark any code without having to
>>> > modify it... On recent linux kernels, use 'perf'.
>>> >
>>> > Benoit
>>> >
>>> >
>>> >>
>>> >> --eamon@xxxxxxxxxxxx - Tel#:+31-6-15142163
>>> >>
>>> >>
>>> >> On Tue, Sep 7, 2010 at 08:15, Daniel Stonier <d.stonier@xxxxxxxxx>
>>> >> wrote:
>>> >>>
>>> >>> Hi lads,
>>> >>>
>>> >>> I've been trying to benchmark eigen2 and eigen3's geometry modules
>>> >>> recently just to get an idea of the speed we can run various
>>> >>> structures at, but I'm having a hard time getting consistent results
>>> >>> and thought you might be able to lend some advice.
>>> >>>
>>> >>> Typically, I do things in the following order on a linux platform with
>>> >>> rt timers (ie clock_gettime(CLOCK_MONOTONIC,...))
>>> >>>
>>> >>> ###########################################
>>> >>> set the process as a real time priority posix process
>>> >>> select transform type
>>> >>> begin_loop
>>> >>> - fill transform with random data
>>> >>> - timestamp
>>> >>> - do a transform product
>>> >>> - timestamp again
>>> >>> - push time diff onto a queue
>>> >>> repeat
>>> >>> do some statistics
>>> >>> ###########################################
>>> >>>
>>> >>> The times I have coming out are extremely inconsistent though:
>>> >>>
>>> >>> - if repeating only 100 times, the product might come out with times
>>> >>> of ~840-846ns one run, then sometimes 300-310ns on another run.
>>> >>> - if repeating 10000 times, it will run at ~840ns for a long time,
>>> >>> then jump down and run at 300-310ns for the remainder.
>>> >>> - running other tests in the loop as well (taking separate timestamps
>>> >>> and using multiple queues) can cause the calculation time to be very
>>> >>> different.
>>> >>> - e.g. this test alone produces results of ~600ns, mingled with
>>> >>> other tests it is usually ~840ns.
>>> >>>
>>> >>> Some troubleshooting:
>>> >>>
>>> >>> - it is not effects from multi-core as the same problems happen when
>>> >>> using taskset to lock it onto a single core.
>>> >>> - it shouldn't be from the scheduler either because it is an elevated
>>> >>> posix real time process.
>>> >>>
>>> >>> I'm baffled. Would really love to know more about how my computer
>>> >>> processes in such a humanly erratic fashion and what's a good way of
>>> >>> testing that.
>>> >>>
>>> >>> Cheers,
>>> >>> Daniel Stonier.
>>> >>>
>>> >>> --
>>> >>> Phone : +82-10-5400-3296 (010-5400-3296)
>>> >>> Home: http://snorriheim.dnsdojo.com/
>>> >>> Yujin Robot: http://www.yujinrobot.com/
>>> >>> Embedded Control Libraries:
>>> >>> http://snorriheim.dnsdojo.com/redmine/wiki/ecl
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>>
>>
>>
>
>
>
--
Phone : +82-10-5400-3296 (010-5400-3296)
Home: http://snorriheim.dnsdojo.com/
Yujin Robot: http://www.yujinrobot.com/
Embedded Control Libraries: http://snorriheim.dnsdojo.com/redmine/wiki/ecl