Re: [eigen] Re: SGEMM benchmark result against ATLAS

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I doubt your CPU run at 1.66 Ghz. It is certainly faster than that
since this (very old) benchmark actually returns the number of madd
per second, so you can multiply your result by 2 to get the actual
number of floating point operations.

I also recommend the newer bench :

g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o
bench_gemm -lrt -lf77blas -DHAVE_BLAS &&  ./bench_gemm

which reports more information about cache/block size and allow you to
tweak the cache parameters.

../bench_gemm h

to get help.

In your case the L3 cache size is really what matters.

gael.


On Tue, Aug 24, 2010 at 5:58 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> Ah! Last spam I promise.
>
> This 6 MB is actually a L3 cache, not L2 cache, for the record.
>
> My cpu has only 256 kB of L2 cache (per core) and 6 MB of L3 cache (shared
> among all cores).
>
> By contrast, Gael, all of your Core 2's cache is L2, there's no L3.
> (Information from wikipedia).
>
> So, I'm wondering if the problem is that my cache is not as fast as Eigen
> expects.
>
> Benoit
>
> 2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>
>> Confirming that Eigen apparently did correctly detect my cache parameters:
>>
>>   std::cout << "l1: " << Eigen::l1CacheSize() << std::endl;
>>   std::cout << "l2: " << Eigen::l2CacheSize() << std::endl;
>>
>> l1: 32768
>> l2: 6291456
>>
>> indeed I have 6 MB or L2 cache.
>>
>> Benoit
>>
>> 2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>>
>>> Hi,
>>>
>>> Hearing from Keir that he saw untuned ATLAS outperform us by a 30%
>>> margin, which would be very unusual, I ran our benchBlasGemm a bit. By the
>>> way, I updated it to make it compile, which involved removing the
>>> eigen_..._normal path which didn't look useful (?), hope it's OK. Also, it
>>> was missing a extern "C" around the cblas #include.
>>>
>>> So I installed the most optimized ATLAS package that I could on Fedora,
>>> built with SSE3.
>>>
>>> I compiled our benchmark with:
>>>
>>> cd eigen/bench/
>>> g++ -O3 -msse3 -I.. -L /usr/lib64/atlas/ benchBlasGemm.cpp  -o
>>> benchBlasGemm -lrt -lcblas
>>>
>>> And ran it on some 4096x4096 matrices:
>>>
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.73982 (7.862 GFlops/s)
>>> eigen : 8.9491 (7.678 GFlops/s)
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.51913 (8.066 GFlops/s)
>>> eigen : 8.42922 (8.152 GFlops/s)
>>>
>>> So _my_ results show Eigen3 and ATLAS running at the same speed roughly,
>>> albeit with a great variability.
>>>
>>> This is still perplexing for 2 reasons:
>>>  - we used to beat ATLAS by a wide margin.
>>>  - the roughly 8 GFlops here are not too good. My CPU is a Core i7 at
>>> 1.66 GHz. So x4 (because of float) and x2 (pipelining of addps and mulps) we
>>> should aim at 13.33 GFlops. So we are running here at only 60% of the
>>> theoretical maximum; I think we used to do much better than that.
>>>
>>> So let me ask Gael and Keir:
>>> * Keir: what do you get on this benchmark? How did you get this result
>>> where ATLAS outperformed us by 30%?
>>> * Gael: suppose I want to get deeper into this, where do I start?
>>>
>>> Cheers,
>>> Benoit
>>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/