Re: [eigen] Performance gap between gcc and msvc ?

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


This smells like code tuned for larger CPU caches than your Core i5 has. Indeed:
 - you are using very large matrices, so it's crucial that blocks fit
in the cpu caches.
 - Core i5 are mass market cpus with presumably not too big caches.

Try finding out the size of your caces (e.g. cat /proc/cpuinfo on
linux) and playing with Eigen's cache size settings (see recent thread
here).

Benoit

2010/6/18  <vincent.lejeune@xxxxxxxxxx>:
>
> In fact after some investigation, I found out that core i5 does perform
>
> very baddly with gcc produced code (there is some benchmark on phoronix.com
>
> that show this).
>
> It is actually slower than a 8years old Pentium 4 that does the
>
> computation in 4s.
>
>
>
> I've tried with several optimisation flags (graphite, loop
>
> parallelisation...) with no success. I always get a 10 s computation, at
>
> best I get 9,39s.
>
> I might try with icc later...
>
>
>
> On Fri, 18 Jun 2010 11:22:59 +0200, Gael Guennebaud
>
> <gael.guennebaud@xxxxxxxxx> wrote:
>
>> On Fri, Jun 18, 2010 at 10:07 AM, Hauke Heibel
>
>> <hauke.heibel@xxxxxxxxxxxxxx> wrote:
>
>>> On Fri, Jun 18, 2010 at 9:32 AM,  <vincent.lejeune@xxxxxxxxxx> wrote:
>
>>>> i've done some performance comparaison between windows and linux,
>
> using
>
>>>> the blocked qr function.
>
>>>> I was using a Core i5 with 3gb memory, and I ran the decomposition on
>
>>>> 2048x2048 double random matrix on 2 operating system :
>
>>>> - The first one is an opensuse 11.3 RC1 64 bits, shipped with gcc 4.5.
>
> I
>
>>>> got the computation done in 10s in release mode (that is, with -O3)
>
>>>> - The second one is Windows 7 64 bits, using Visual C++ 2010 express.
>
> It
>
>>>> ships with the 32 bits version of the compiler, and I've heard that
>
> some
>
>>>> feature like openMP are disabled. However, the computation was done in
>
>>>> 6s
>
>>>> with release mode...
>
>>>
>
>>>> I've got something like a 40% performance drop for gcc in comparaison
>
> to
>
>>>> VC++ 2010. I've heard that gcc generated code was marginally slower
>
> than
>
>>>> MSVC one in some case, but 40% is not something negligible in my
>
>>>> opinion.
>
>>>
>
>>> Typically it is vice versa, i.e. normally GCC produces faster code. In
>
>>> particular 32bit builds with MSVC are rather bad since the register
>
>>> handling of MSVC's 32bit compiler is far from optimal. So you really
>
>>> seem to be missing some important flag for GCC. Could it be that you
>
>>> still have debug symbols enabled?
>
>>
>
>> I'm also surprised by your results because with Eigen we always found
>
>> that GCC outperformed MSVC.
>
>>
>
>> On my computer (Core2, 2.66GHz, 64bits system, gcc 4.4), and with many
>
>> compilations in the background, I get the following timings (block
>
>> size = 128):
>
>>
>
>> 2048^2, float : 1.1 sec
>
>> 2048^2, double : 2.5 sec
>
>>
>
>> the compilation command:
>
>>
>
>> g++-4.4 -I.. -O3 -lrt -DNDEBUG bench_qr.cpp && ./a.out
>
>>
>
>> the test program:
>
>>
>
>> #include <Eigen/QR>
>
>> #include "BenchTimer.h"
>
>> #include <iostream>
>
>>
>
>> using namespace Eigen;
>
>> int main()
>
>> {
>
>>   typedef MatrixXd Mat;
>
>>   int s = 2048;
>
>>   Mat m = Mat::Random(s,s);
>
>>   BenchTimer t;
>
>>   HouseholderQR<Mat> qr(m);
>
>>   BENCH(t, 4, 1, qr.compute(m));
>
>>   std::cout << t.value() << "s\n";
>
>> }
>
>>
>
>> With -O3 it is a bit slower. Actually, with Eigen the recommended
>
>> flags are simply -O2 -DNDEBUG for a 64bits system, and add -msse2 for
>
>> a 32 bits system to enable SSE optimizations.
>
>>
>
>> Regarding openmp, with gcc you can enable it with -fopenmp, however
>
>> here it seems there is no gain because the blocks are too small...
>
>>
>
>> gael
>
>>
>
>>>
>
>>>> On another note I ran qr decomposition for a 2048x2048 random matrix
>
>>>> under
>
>>>> scilab on windows, because scilab ships with a (binary only) mkl on
>
>>>> windows. The computation is done in 2s.
>
>>>
>
>>>> I think that the difference may be explained by MSVC disabling openMP
>
> on
>
>>>> express version of the compiler, as Core i5 does have 4 logical core
>
>>>> (2physical+2 Hyperthreaded I think), hence a performance improvement.
>
> I
>
>>>> would like to know if Eigen does use openMP feature on matrix product,
>
>>>> simultineously with vectorisation feature.
>
>>>
>
>>> OpenMP is always (!) disabled per default on MSVC as is vectorization
>
>>> (only under 32bit builds). For 64bit builds vectorization always takes
>
>>> place. OpenMP can be enabled under "Properties -> C/C++ -> Language ->
>
>>> Open MP Support". SSE is enabled via "Properties -> C/C++ -> Code
>
>>> Generation -> Enable Enhanced Instruction Set". Regarding the
>
>>> simultaneous usage, it is possible to use OpenMP and SSE.
>
>>>
>
>>> Regards,
>
>>> Hauke
>
>>>
>
>>>
>
>>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/