Re: [eigen] Performance gap between gcc and msvc ? |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Performance gap between gcc and msvc ?
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Fri, 18 Jun 2010 15:09:04 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=N6tktgme2Qmig9Hc0MtlgvAgqo4ijr6cunGlazTh/L0=; b=WeTPDduozWMiWsRVGwG74VZZSWFZhTVLZvHMY8hnmklnBwXcpl/Uz7Mb31UaDqtIb8 B6r6Uh5NGwntO8v8+t6NAu3CtExiETevigfV5TB3uPzxjl41IK6GZcdiK8cg5jP55d+K g1ljxhVb0FdZdVlLK90E6n8l4b0Vf8r5P4uyE=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=u5DqCRXWgIhW+e8iqW702KcDzTFpegvl/K5GtsZbWU4jcEDvwM1NjQ5H2wWuZdVnYi YuenbxquJ2OHJ9vL8GeFSEBt2ILHwkuPYC8PHqShhRyMJbfv2dPNimc//iO9fVEiCC1W d34onnr/JieqeTOrlFVj8nE5jGV90YvuQQD7Y=
This smells like code tuned for larger CPU caches than your Core i5 has. Indeed:
- you are using very large matrices, so it's crucial that blocks fit
in the cpu caches.
- Core i5 are mass market cpus with presumably not too big caches.
Try finding out the size of your caces (e.g. cat /proc/cpuinfo on
linux) and playing with Eigen's cache size settings (see recent thread
here).
Benoit
2010/6/18 <vincent.lejeune@xxxxxxxxxx>:
>
> In fact after some investigation, I found out that core i5 does perform
>
> very baddly with gcc produced code (there is some benchmark on phoronix.com
>
> that show this).
>
> It is actually slower than a 8years old Pentium 4 that does the
>
> computation in 4s.
>
>
>
> I've tried with several optimisation flags (graphite, loop
>
> parallelisation...) with no success. I always get a 10 s computation, at
>
> best I get 9,39s.
>
> I might try with icc later...
>
>
>
> On Fri, 18 Jun 2010 11:22:59 +0200, Gael Guennebaud
>
> <gael.guennebaud@xxxxxxxxx> wrote:
>
>> On Fri, Jun 18, 2010 at 10:07 AM, Hauke Heibel
>
>> <hauke.heibel@xxxxxxxxxxxxxx> wrote:
>
>>> On Fri, Jun 18, 2010 at 9:32 AM, <vincent.lejeune@xxxxxxxxxx> wrote:
>
>>>> i've done some performance comparaison between windows and linux,
>
> using
>
>>>> the blocked qr function.
>
>>>> I was using a Core i5 with 3gb memory, and I ran the decomposition on
>
>>>> 2048x2048 double random matrix on 2 operating system :
>
>>>> - The first one is an opensuse 11.3 RC1 64 bits, shipped with gcc 4.5.
>
> I
>
>>>> got the computation done in 10s in release mode (that is, with -O3)
>
>>>> - The second one is Windows 7 64 bits, using Visual C++ 2010 express.
>
> It
>
>>>> ships with the 32 bits version of the compiler, and I've heard that
>
> some
>
>>>> feature like openMP are disabled. However, the computation was done in
>
>>>> 6s
>
>>>> with release mode...
>
>>>
>
>>>> I've got something like a 40% performance drop for gcc in comparaison
>
> to
>
>>>> VC++ 2010. I've heard that gcc generated code was marginally slower
>
> than
>
>>>> MSVC one in some case, but 40% is not something negligible in my
>
>>>> opinion.
>
>>>
>
>>> Typically it is vice versa, i.e. normally GCC produces faster code. In
>
>>> particular 32bit builds with MSVC are rather bad since the register
>
>>> handling of MSVC's 32bit compiler is far from optimal. So you really
>
>>> seem to be missing some important flag for GCC. Could it be that you
>
>>> still have debug symbols enabled?
>
>>
>
>> I'm also surprised by your results because with Eigen we always found
>
>> that GCC outperformed MSVC.
>
>>
>
>> On my computer (Core2, 2.66GHz, 64bits system, gcc 4.4), and with many
>
>> compilations in the background, I get the following timings (block
>
>> size = 128):
>
>>
>
>> 2048^2, float : 1.1 sec
>
>> 2048^2, double : 2.5 sec
>
>>
>
>> the compilation command:
>
>>
>
>> g++-4.4 -I.. -O3 -lrt -DNDEBUG bench_qr.cpp && ./a.out
>
>>
>
>> the test program:
>
>>
>
>> #include <Eigen/QR>
>
>> #include "BenchTimer.h"
>
>> #include <iostream>
>
>>
>
>> using namespace Eigen;
>
>> int main()
>
>> {
>
>> typedef MatrixXd Mat;
>
>> int s = 2048;
>
>> Mat m = Mat::Random(s,s);
>
>> BenchTimer t;
>
>> HouseholderQR<Mat> qr(m);
>
>> BENCH(t, 4, 1, qr.compute(m));
>
>> std::cout << t.value() << "s\n";
>
>> }
>
>>
>
>> With -O3 it is a bit slower. Actually, with Eigen the recommended
>
>> flags are simply -O2 -DNDEBUG for a 64bits system, and add -msse2 for
>
>> a 32 bits system to enable SSE optimizations.
>
>>
>
>> Regarding openmp, with gcc you can enable it with -fopenmp, however
>
>> here it seems there is no gain because the blocks are too small...
>
>>
>
>> gael
>
>>
>
>>>
>
>>>> On another note I ran qr decomposition for a 2048x2048 random matrix
>
>>>> under
>
>>>> scilab on windows, because scilab ships with a (binary only) mkl on
>
>>>> windows. The computation is done in 2s.
>
>>>
>
>>>> I think that the difference may be explained by MSVC disabling openMP
>
> on
>
>>>> express version of the compiler, as Core i5 does have 4 logical core
>
>>>> (2physical+2 Hyperthreaded I think), hence a performance improvement.
>
> I
>
>>>> would like to know if Eigen does use openMP feature on matrix product,
>
>>>> simultineously with vectorisation feature.
>
>>>
>
>>> OpenMP is always (!) disabled per default on MSVC as is vectorization
>
>>> (only under 32bit builds). For 64bit builds vectorization always takes
>
>>> place. OpenMP can be enabled under "Properties -> C/C++ -> Language ->
>
>>> Open MP Support". SSE is enabled via "Properties -> C/C++ -> Code
>
>>> Generation -> Enable Enhanced Instruction Set". Regarding the
>
>>> simultaneous usage, it is possible to use OpenMP and SSE.
>
>>>
>
>>> Regards,
>
>>> Hauke
>
>>>
>
>>>
>
>>>
>
>
>