Re: [eigen] Performance gap between gcc and msvc ? |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Performance gap between gcc and msvc ?
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Fri, 18 Jun 2010 23:31:01 +0200
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=/vpa80ocQMqKIHsLoZSJRQweCPPVTCA+3jno3HGzT3k=; b=aMwgdnx/abwPN11xJeIETIpjpCO9zzufGxk/9gWFzVwvB0SaDtlOzKoSXxt1eX6WdJ bGxlP052QJMInd7YEIyyNkTM94sjwYVbQL5lL4V2+2n7TUfu6Dp1bxWiMaqRM9Di1V2Z bWZuBRgxdLCbn/Q3Uy2gUS5wvTYHmqtaLrCoQ=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=ahXeyYYzBVK6YO9ZbflEVoWfXsYmQwhTdXl/eQgWjYlIOPgWEPEHYNyr6t5BZFLrPB dEB+8+WFSNH2lS7Ac6Boc1qJD3b//OJIfCyuRHs222YMsh1DEys2m/NXb8jYAXJNctad DkaiJom5+5PqqSjriFx0lXU8abDOqHXSIvtb8=
there is an easy way to check that: go to eigen/bench and do:
g++-4.4 -O2 -DNDEBUG bench_gemm.cpp -I .. -lrt -o gemm && ./gemm
here, on a core2 @ 2Ghz I get:
Matrix size = 2048
blocking size = 512 x 256
eigen cpu 1.2136s 14.1561 GFLOPS (6.10134s)
eigen real 1.22459s 14.0291 GFLOPS (6.19005s)
on recent CPU, the peak performance is cpu_freq * 8, so in my case 16
GFLOPS. The ratio observed perf / theoretical peak perf gives a good
idea on the performance of the matrix product.
If you 'hg pull -u', then you can tweak the blocking sizes as follow:
../gemm c<cache size>
where you replace <cache size> by what works the best for your CPU.
in my case it is:
../gemm c1048576
1048576 == 1MB.
Actually, I'd really like to know what happens for core i5 and/or i7
(which are very similar) because they have a quite different cache
layout: a small L2 cache per core, and one large shared L3 while older
CPU have only two levels of caches with a large and shared L2.
cheers,
gael
On Fri, Jun 18, 2010 at 9:09 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> This smells like code tuned for larger CPU caches than your Core i5 has. Indeed:
> - you are using very large matrices, so it's crucial that blocks fit
> in the cpu caches.
> - Core i5 are mass market cpus with presumably not too big caches.
>
> Try finding out the size of your caces (e.g. cat /proc/cpuinfo on
> linux) and playing with Eigen's cache size settings (see recent thread
> here).
>
> Benoit
>
> 2010/6/18 <vincent.lejeune@xxxxxxxxxx>:
>>
>> In fact after some investigation, I found out that core i5 does perform
>>
>> very baddly with gcc produced code (there is some benchmark on phoronix.com
>>
>> that show this).
>>
>> It is actually slower than a 8years old Pentium 4 that does the
>>
>> computation in 4s.
>>
>>
>>
>> I've tried with several optimisation flags (graphite, loop
>>
>> parallelisation...) with no success. I always get a 10 s computation, at
>>
>> best I get 9,39s.
>>
>> I might try with icc later...
>>
>>
>>
>> On Fri, 18 Jun 2010 11:22:59 +0200, Gael Guennebaud
>>
>> <gael.guennebaud@xxxxxxxxx> wrote:
>>
>>> On Fri, Jun 18, 2010 at 10:07 AM, Hauke Heibel
>>
>>> <hauke.heibel@xxxxxxxxxxxxxx> wrote:
>>
>>>> On Fri, Jun 18, 2010 at 9:32 AM, <vincent.lejeune@xxxxxxxxxx> wrote:
>>
>>>>> i've done some performance comparaison between windows and linux,
>>
>> using
>>
>>>>> the blocked qr function.
>>
>>>>> I was using a Core i5 with 3gb memory, and I ran the decomposition on
>>
>>>>> 2048x2048 double random matrix on 2 operating system :
>>
>>>>> - The first one is an opensuse 11.3 RC1 64 bits, shipped with gcc 4.5..
>>
>> I
>>
>>>>> got the computation done in 10s in release mode (that is, with -O3)
>>
>>>>> - The second one is Windows 7 64 bits, using Visual C++ 2010 express.
>>
>> It
>>
>>>>> ships with the 32 bits version of the compiler, and I've heard that
>>
>> some
>>
>>>>> feature like openMP are disabled. However, the computation was done in
>>
>>>>> 6s
>>
>>>>> with release mode...
>>
>>>>
>>
>>>>> I've got something like a 40% performance drop for gcc in comparaison
>>
>> to
>>
>>>>> VC++ 2010. I've heard that gcc generated code was marginally slower
>>
>> than
>>
>>>>> MSVC one in some case, but 40% is not something negligible in my
>>
>>>>> opinion.
>>
>>>>
>>
>>>> Typically it is vice versa, i.e. normally GCC produces faster code. In
>>
>>>> particular 32bit builds with MSVC are rather bad since the register
>>
>>>> handling of MSVC's 32bit compiler is far from optimal. So you really
>>
>>>> seem to be missing some important flag for GCC. Could it be that you
>>
>>>> still have debug symbols enabled?
>>
>>>
>>
>>> I'm also surprised by your results because with Eigen we always found
>>
>>> that GCC outperformed MSVC.
>>
>>>
>>
>>> On my computer (Core2, 2.66GHz, 64bits system, gcc 4.4), and with many
>>
>>> compilations in the background, I get the following timings (block
>>
>>> size = 128):
>>
>>>
>>
>>> 2048^2, float : 1.1 sec
>>
>>> 2048^2, double : 2.5 sec
>>
>>>
>>
>>> the compilation command:
>>
>>>
>>
>>> g++-4.4 -I.. -O3 -lrt -DNDEBUG bench_qr.cpp && ./a.out
>>
>>>
>>
>>> the test program:
>>
>>>
>>
>>> #include <Eigen/QR>
>>
>>> #include "BenchTimer.h"
>>
>>> #include <iostream>
>>
>>>
>>
>>> using namespace Eigen;
>>
>>> int main()
>>
>>> {
>>
>>> typedef MatrixXd Mat;
>>
>>> int s = 2048;
>>
>>> Mat m = Mat::Random(s,s);
>>
>>> BenchTimer t;
>>
>>> HouseholderQR<Mat> qr(m);
>>
>>> BENCH(t, 4, 1, qr.compute(m));
>>
>>> std::cout << t.value() << "s\n";
>>
>>> }
>>
>>>
>>
>>> With -O3 it is a bit slower. Actually, with Eigen the recommended
>>
>>> flags are simply -O2 -DNDEBUG for a 64bits system, and add -msse2 for
>>
>>> a 32 bits system to enable SSE optimizations.
>>
>>>
>>
>>> Regarding openmp, with gcc you can enable it with -fopenmp, however
>>
>>> here it seems there is no gain because the blocks are too small...
>>
>>>
>>
>>> gael
>>
>>>
>>
>>>>
>>
>>>>> On another note I ran qr decomposition for a 2048x2048 random matrix
>>
>>>>> under
>>
>>>>> scilab on windows, because scilab ships with a (binary only) mkl on
>>
>>>>> windows. The computation is done in 2s.
>>
>>>>
>>
>>>>> I think that the difference may be explained by MSVC disabling openMP
>>
>> on
>>
>>>>> express version of the compiler, as Core i5 does have 4 logical core
>>
>>>>> (2physical+2 Hyperthreaded I think), hence a performance improvement.
>>
>> I
>>
>>>>> would like to know if Eigen does use openMP feature on matrix product,
>>
>>>>> simultineously with vectorisation feature.
>>
>>>>
>>
>>>> OpenMP is always (!) disabled per default on MSVC as is vectorization
>>
>>>> (only under 32bit builds). For 64bit builds vectorization always takes
>>
>>>> place. OpenMP can be enabled under "Properties -> C/C++ -> Language ->
>>
>>>> Open MP Support". SSE is enabled via "Properties -> C/C++ -> Code
>>
>>>> Generation -> Enable Enhanced Instruction Set". Regarding the
>>
>>>> simultaneous usage, it is possible to use OpenMP and SSE.
>>
>>>>
>>
>>>> Regards,
>>
>>>> Hauke
>>
>>>>
>>
>>>>
>>
>>>>
>>
>>
>>
>
>
>