Re: [eigen] a branch for SMP (openmp) experimentations

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]



hi,

nice results!

however, in order to estimate the efficiency wrt the number of threads, you  should run it with OMP_NUM_THREADS=1, and not use the CPU time returned in the multithreaded case which is meaningless in this case. Then I expect a ratio much lower than 99% !

basically, if mono threaded then use the "cpu" time, otherwise use the "real" time.

gael

On Fri, Feb 26, 2010 at 9:17 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:
Some nice bench results coming off the X5550 @ 2.67GHz

(single-precision)
[aron@kw2050]~/sandbox/eigen-smp/bench% g++ bench_gemm.cpp -DNDEBUG
-DHAVE_BLAS -I.. -O2 -fopenmp -lrt -lblas -o ./bench  && /usr/bin/time
-p ./bench
blas  cpu   0.133795s   2.00632 GFLOPS  (14.0114s)
blas  real  0.133813s   2.00605 GFLOPS  (13.3878s)
eigen cpu   0.0191605s          14.0098 GFLOPS  (1.92616s)
eigen real  0.0024013s          111.787 GFLOPS  (0.241387s)
real 13.79
user 16.08
sys 0.13

For whatever reason, the BLAS isn't built multi-threaded, but its
performance is pretty terrible even single-threaded.  If these numbers
are to be believed, Gael's multi-threaded multiply scales with 99.7%
efficiency on the X5550, averaging 2.6/4 SIMD fused multiply-add
operations per cycle in single precision.

(double-precision)
[aron@kw2050]~/sandbox/eigen-smp/bench% g++ bench_gemm.cpp -DNDEBUG
-DHAVE_BLAS -I.. -O2 -fopenmp -lrt -lblas -o ./bench  && /usr/bin/time
-p ./bench
Warning, your parallel product is crap! <I need to fix this>

blas  cpu   0.13462s    1.99402 GFLOPS  (14.0937s)
blas  real  0.134625s   1.99395 GFLOPS  (13.4901s)
eigen cpu   0.0363907s          7.37649 GFLOPS  (3.70925s)
eigen real  0.00455555s         58.925 GFLOPS   (0.465924s)
real 14.11
user 17.95
sys 0.11

Again, near-perfect scaling, and eigen is averaging 1.4/2 SIMD fused
multiply-add operations per cycle in double precision.

I'll look at this more later this week, and I'd like to more carefully
verify these numbers since they're pretty astonishing to me.  Gael,
I'm happy to give you an honorary A+ in my Parallel Computing
Paradigms course if these are legit.

A

On Fri, Feb 26, 2010 at 4:26 PM, Gael Guennebaud
> Thank you for link too :)
>
> And to entertain everybody following our adventures, here are the mandatory
> pictures:
>
> * single core: http://dl.dropbox.com/u/260133/matrix_matrix.pdf
> * quad cores: http://dl.dropbox.com/u/260133/matrix_matrix-smp..pdf
>
> gael
>
>
> On Fri, Feb 26, 2010 at 1:02 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:
>>
>> Those are some good notes, thanks Frank.
>>
>> It's easy to get confused there because he's assuming a distributed
>> memory layout, but still, that might be a useful technique to try and
>> apply.
>>
>> A
>>
>> On Fri, Feb 26, 2010 at 2:57 PM, FMDSPAM <fmdspam@xxxxxxxxx> wrote:
>> > Am 26.02.2010 11:28, schrieb Aron Ahmadia:
>> >
>> > <snip>
>> >
>> > Okay, this might be a bit tricky, so forgive me if I'm
>> > over-complicating things, can we introduce another subdivision?:
>> >
>> >
>> >
>> > Forgive me my shameless plug. A short discussion on that topic I've
>> > found
>> > some day here .
>> > Most of what he is discussing, and what you are doing, are bejond my
>> > skills.
>> > but possible it helps.
>> >
>> > Frank.
>> >
>> >
>>
>>
>
>





Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/