Re: [eigen] a branch for SMP (openmp) experimentations |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] a branch for SMP (openmp) experimentations
- From: Aron Ahmadia <aja2111@xxxxxxxxxxxx>
- Date: Fri, 26 Feb 2010 23:17:54 +0300
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type; bh=m3UpMlW4CF2eDqSaTqlXSXVaOWXmLy0YZjjYMKly1EI=; b=F3c9W8N2OMRgN8Br95ktO6NyFbg+HcZWLm6WzsIXmQrJjVcR7z7ckbbycbUwIGAb22 rAKuUTY4N6r+yrKcMUisH19P2F11Ea27PnpuFtSuHhZ1oWUCiHTsAk3ZI481p2DKAHXu NwwG0ZtliLVjBcUI68Iq3uH5O0dwHLjioRQmQ=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=Vl+mOaYcvOk6kIwYztHBgdafL3iAu6mDId97ehKdFIYqetURfJVQ/nNqBqzEAdK7zu HnE3v9nukC574E/8ZISQbfHbDcqZyTCO3jAnARY/QUMzRIkqlswCLjJY2IPEOrRq3Sl+ LYv0I8Os4pdweWjsAjxS23wqDHlWIb9pnrT64=
Some nice bench results coming off the X5550 @ 2.67GHz
(single-precision)
[aron@kw2050]~/sandbox/eigen-smp/bench% g++ bench_gemm.cpp -DNDEBUG
-DHAVE_BLAS -I.. -O2 -fopenmp -lrt -lblas -o ./bench && /usr/bin/time
-p ./bench
blas cpu 0.133795s 2.00632 GFLOPS (14.0114s)
blas real 0.133813s 2.00605 GFLOPS (13.3878s)
eigen cpu 0.0191605s 14.0098 GFLOPS (1.92616s)
eigen real 0.0024013s 111.787 GFLOPS (0.241387s)
real 13.79
user 16.08
sys 0.13
For whatever reason, the BLAS isn't built multi-threaded, but its
performance is pretty terrible even single-threaded. If these numbers
are to be believed, Gael's multi-threaded multiply scales with 99.7%
efficiency on the X5550, averaging 2.6/4 SIMD fused multiply-add
operations per cycle in single precision.
(double-precision)
[aron@kw2050]~/sandbox/eigen-smp/bench% g++ bench_gemm.cpp -DNDEBUG
-DHAVE_BLAS -I.. -O2 -fopenmp -lrt -lblas -o ./bench && /usr/bin/time
-p ./bench
Warning, your parallel product is crap! <I need to fix this>
blas cpu 0.13462s 1.99402 GFLOPS (14.0937s)
blas real 0.134625s 1.99395 GFLOPS (13.4901s)
eigen cpu 0.0363907s 7.37649 GFLOPS (3.70925s)
eigen real 0.00455555s 58.925 GFLOPS (0.465924s)
real 14.11
user 17.95
sys 0.11
Again, near-perfect scaling, and eigen is averaging 1.4/2 SIMD fused
multiply-add operations per cycle in double precision.
I'll look at this more later this week, and I'd like to more carefully
verify these numbers since they're pretty astonishing to me. Gael,
I'm happy to give you an honorary A+ in my Parallel Computing
Paradigms course if these are legit.
A
On Fri, Feb 26, 2010 at 4:26 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
>
> Thank you for link too :)
>
> And to entertain everybody following our adventures, here are the mandatory
> pictures:
>
> * single core: http://dl.dropbox.com/u/260133/matrix_matrix.pdf
> * quad cores: http://dl.dropbox.com/u/260133/matrix_matrix-smp.pdf
>
> gael
>
>
> On Fri, Feb 26, 2010 at 1:02 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:
>>
>> Those are some good notes, thanks Frank.
>>
>> It's easy to get confused there because he's assuming a distributed
>> memory layout, but still, that might be a useful technique to try and
>> apply.
>>
>> A
>>
>> On Fri, Feb 26, 2010 at 2:57 PM, FMDSPAM <fmdspam@xxxxxxxxx> wrote:
>> > Am 26.02.2010 11:28, schrieb Aron Ahmadia:
>> >
>> > <snip>
>> >
>> > Okay, this might be a bit tricky, so forgive me if I'm
>> > over-complicating things, can we introduce another subdivision?:
>> >
>> >
>> >
>> > Forgive me my shameless plug. A short discussion on that topic I've
>> > found
>> > some day here .
>> > Most of what he is discussing, and what you are doing, are bejond my
>> > skills.
>> > but possible it helps.
>> >
>> > Frank.
>> >
>> >
>>
>>
>
>