Re: [eigen] Performance gap between gcc and msvc ?

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


In fact after some investigation, I found out that core i5 does perform

very baddly with gcc produced code (there is some benchmark on phoronix.com

that show this).

It is actually slower than a 8years old Pentium 4 that does the

computation in 4s.



I've tried with several optimisation flags (graphite, loop

parallelisation...) with no success. I always get a 10 s computation, at

best I get 9,39s.

I might try with icc later...



On Fri, 18 Jun 2010 11:22:59 +0200, Gael Guennebaud

<gael.guennebaud@xxxxxxxxx> wrote:

> On Fri, Jun 18, 2010 at 10:07 AM, Hauke Heibel

> <hauke.heibel@xxxxxxxxxxxxxx> wrote:

>> On Fri, Jun 18, 2010 at 9:32 AM,  <vincent.lejeune@xxxxxxxxxx> wrote:

>>> i've done some performance comparaison between windows and linux,

using

>>> the blocked qr function.

>>> I was using a Core i5 with 3gb memory, and I ran the decomposition on

>>> 2048x2048 double random matrix on 2 operating system :

>>> - The first one is an opensuse 11.3 RC1 64 bits, shipped with gcc 4.5.

I

>>> got the computation done in 10s in release mode (that is, with -O3)

>>> - The second one is Windows 7 64 bits, using Visual C++ 2010 express.

It

>>> ships with the 32 bits version of the compiler, and I've heard that

some

>>> feature like openMP are disabled. However, the computation was done in

>>> 6s

>>> with release mode...

>>

>>> I've got something like a 40% performance drop for gcc in comparaison

to

>>> VC++ 2010. I've heard that gcc generated code was marginally slower

than

>>> MSVC one in some case, but 40% is not something negligible in my

>>> opinion.

>>

>> Typically it is vice versa, i.e. normally GCC produces faster code. In

>> particular 32bit builds with MSVC are rather bad since the register

>> handling of MSVC's 32bit compiler is far from optimal. So you really

>> seem to be missing some important flag for GCC. Could it be that you

>> still have debug symbols enabled?

> 

> I'm also surprised by your results because with Eigen we always found

> that GCC outperformed MSVC.

> 

> On my computer (Core2, 2.66GHz, 64bits system, gcc 4.4), and with many

> compilations in the background, I get the following timings (block

> size = 128):

> 

> 2048^2, float : 1.1 sec

> 2048^2, double : 2.5 sec

> 

> the compilation command:

> 

> g++-4.4 -I.. -O3 -lrt -DNDEBUG bench_qr.cpp && ./a.out

> 

> the test program:

> 

> #include <Eigen/QR>

> #include "BenchTimer.h"

> #include <iostream>

> 

> using namespace Eigen;

> int main()

> {

>   typedef MatrixXd Mat;

>   int s = 2048;

>   Mat m = Mat::Random(s,s);

>   BenchTimer t;

>   HouseholderQR<Mat> qr(m);

>   BENCH(t, 4, 1, qr.compute(m));

>   std::cout << t.value() << "s\n";

> }

> 

> With -O3 it is a bit slower. Actually, with Eigen the recommended

> flags are simply -O2 -DNDEBUG for a 64bits system, and add -msse2 for

> a 32 bits system to enable SSE optimizations.

> 

> Regarding openmp, with gcc you can enable it with -fopenmp, however

> here it seems there is no gain because the blocks are too small...

> 

> gael

> 

>>

>>> On another note I ran qr decomposition for a 2048x2048 random matrix

>>> under

>>> scilab on windows, because scilab ships with a (binary only) mkl on

>>> windows. The computation is done in 2s.

>>

>>> I think that the difference may be explained by MSVC disabling openMP

on

>>> express version of the compiler, as Core i5 does have 4 logical core

>>> (2physical+2 Hyperthreaded I think), hence a performance improvement.

I

>>> would like to know if Eigen does use openMP feature on matrix product,

>>> simultineously with vectorisation feature.

>>

>> OpenMP is always (!) disabled per default on MSVC as is vectorization

>> (only under 32bit builds). For 64bit builds vectorization always takes

>> place. OpenMP can be enabled under "Properties -> C/C++ -> Language ->

>> Open MP Support". SSE is enabled via "Properties -> C/C++ -> Code

>> Generation -> Enable Enhanced Instruction Set". Regarding the

>> simultaneous usage, it is possible to use OpenMP and SSE.

>>

>> Regards,

>> Hauke

>>

>>

>>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/