Re: [eigen] a branch for SMP (openmp) experimentations

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Ahh, this makes much more sense, I was just trying to figure out what
I was doing wrong...

Unfortunately, this machine is occupied for the next 3 days, so I
can't get reliable numbers out of it until then :(

A

On Sat, Feb 27, 2010 at 5:48 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
>
> hi,
>
> nice results!
>
> however, in order to estimate the efficiency wrt the number of threads, you
> should run it with OMP_NUM_THREADS=1, and not use the CPU time returned in
> the multithreaded case which is meaningless in this case. Then I expect a
> ratio much lower than 99% !
>
> basically, if mono threaded then use the "cpu" time, otherwise use the
> "real" time.
>
> gael
>
> On Fri, Feb 26, 2010 at 9:17 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:
>>
>> Some nice bench results coming off the X5550 @ 2.67GHz
>>
>> (single-precision)
>> [aron@kw2050]~/sandbox/eigen-smp/bench% g++ bench_gemm.cpp -DNDEBUG
>> -DHAVE_BLAS -I.. -O2 -fopenmp -lrt -lblas -o ./bench  && /usr/bin/time
>> -p ./bench
>> blas  cpu   0.133795s   2.00632 GFLOPS  (14.0114s)
>> blas  real  0.133813s   2.00605 GFLOPS  (13.3878s)
>> eigen cpu   0.0191605s          14.0098 GFLOPS  (1.92616s)
>> eigen real  0.0024013s          111.787 GFLOPS  (0.241387s)
>> real 13.79
>> user 16.08
>> sys 0.13
>>
>> For whatever reason, the BLAS isn't built multi-threaded, but its
>> performance is pretty terrible even single-threaded.  If these numbers
>> are to be believed, Gael's multi-threaded multiply scales with 99.7%
>> efficiency on the X5550, averaging 2.6/4 SIMD fused multiply-add
>> operations per cycle in single precision.
>>
>> (double-precision)
>> [aron@kw2050]~/sandbox/eigen-smp/bench% g++ bench_gemm.cpp -DNDEBUG
>> -DHAVE_BLAS -I.. -O2 -fopenmp -lrt -lblas -o ./bench  && /usr/bin/time
>> -p ./bench
>> Warning, your parallel product is crap! <I need to fix this>
>>
>> blas  cpu   0.13462s    1.99402 GFLOPS  (14.0937s)
>> blas  real  0.134625s   1.99395 GFLOPS  (13.4901s)
>> eigen cpu   0.0363907s          7.37649 GFLOPS  (3.70925s)
>> eigen real  0.00455555s         58.925 GFLOPS   (0.465924s)
>> real 14.11
>> user 17.95
>> sys 0.11
>>
>> Again, near-perfect scaling, and eigen is averaging 1.4/2 SIMD fused
>> multiply-add operations per cycle in double precision.
>>
>> I'll look at this more later this week, and I'd like to more carefully
>> verify these numbers since they're pretty astonishing to me.  Gael,
>> I'm happy to give you an honorary A+ in my Parallel Computing
>> Paradigms course if these are legit.
>>
>> A
>>
>> On Fri, Feb 26, 2010 at 4:26 PM, Gael Guennebaud
>> <gael.guennebaud@xxxxxxxxx> wrote:
>> >
>> > Thank you for link too :)
>> >
>> > And to entertain everybody following our adventures, here are the
>> > mandatory
>> > pictures:
>> >
>> > * single core: http://dl.dropbox.com/u/260133/matrix_matrix.pdf
>> > * quad cores: http://dl.dropbox.com/u/260133/matrix_matrix-smp.pdf
>> >
>> > gael
>> >
>> >
>> > On Fri, Feb 26, 2010 at 1:02 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx>
>> > wrote:
>> >>
>> >> Those are some good notes, thanks Frank.
>> >>
>> >> It's easy to get confused there because he's assuming a distributed
>> >> memory layout, but still, that might be a useful technique to try and
>> >> apply.
>> >>
>> >> A
>> >>
>> >> On Fri, Feb 26, 2010 at 2:57 PM, FMDSPAM <fmdspam@xxxxxxxxx> wrote:
>> >> > Am 26.02.2010 11:28, schrieb Aron Ahmadia:
>> >> >
>> >> > <snip>
>> >> >
>> >> > Okay, this might be a bit tricky, so forgive me if I'm
>> >> > over-complicating things, can we introduce another subdivision?:
>> >> >
>> >> >
>> >> >
>> >> > Forgive me my shameless plug. A short discussion on that topic I've
>> >> > found
>> >> > some day here .
>> >> > Most of what he is discussing, and what you are doing, are bejond my
>> >> > skills.
>> >> > but possible it helps.
>> >> >
>> >> > Frank.
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>>
>>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/