On Tue, Aug 24, 2010 at 5:58 AM, Benoit Jacob <
jacob.benoit.1@xxxxxxxxx> wrote:
> Ah! Last spam I promise.
>
> This 6 MB is actually a L3 cache, not L2 cache, for the record.
>
> My cpu has only 256 kB of L2 cache (per core) and 6 MB of L3 cache (shared
> among all cores).
>
> By contrast, Gael, all of your Core 2's cache is L2, there's no L3.
> (Information from wikipedia).
>
> So, I'm wondering if the problem is that my cache is not as fast as Eigen
> expects.
>
> Benoit
>
> 2010/8/23 Benoit Jacob <
jacob.benoit.1@xxxxxxxxx>
>>
>> Confirming that Eigen apparently did correctly detect my cache parameters:
>>
>> std::cout << "l1: " << Eigen::l1CacheSize() << std::endl;
>> std::cout << "l2: " << Eigen::l2CacheSize() << std::endl;
>>
>> l1: 32768
>> l2: 6291456
>>
>> indeed I have 6 MB or L2 cache.
>>
>> Benoit
>>
>> 2010/8/23 Benoit Jacob <
jacob.benoit.1@xxxxxxxxx>
>>>
>>> Hi,
>>>
>>> Hearing from Keir that he saw untuned ATLAS outperform us by a 30%
>>> margin, which would be very unusual, I ran our benchBlasGemm a bit. By the
>>> way, I updated it to make it compile, which involved removing the
>>> eigen_..._normal path which didn't look useful (?), hope it's OK. Also, it
>>> was missing a extern "C" around the cblas #include.
>>>
>>> So I installed the most optimized ATLAS package that I could on Fedora,
>>> built with SSE3.
>>>
>>> I compiled our benchmark with:
>>>
>>> cd eigen/bench/
>>> g++ -O3 -msse3 -I.. -L /usr/lib64/atlas/ benchBlasGemm.cpp -o
>>> benchBlasGemm -lrt -lcblas
>>>
>>> And ran it on some 4096x4096 matrices:
>>>
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.73982 (7.862 GFlops/s)
>>> eigen : 8.9491 (7.678 GFlops/s)
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.51913 (8.066 GFlops/s)
>>> eigen : 8.42922 (8.152 GFlops/s)
>>>
>>> So _my_ results show Eigen3 and ATLAS running at the same speed roughly,
>>> albeit with a great variability.
>>>
>>> This is still perplexing for 2 reasons:
>>> - we used to beat ATLAS by a wide margin.
>>> - the roughly 8 GFlops here are not too good. My CPU is a Core i7 at
>>> 1.66 GHz. So x4 (because of float) and x2 (pipelining of addps and mulps) we
>>> should aim at 13.33 GFlops. So we are running here at only 60% of the
>>> theoretical maximum; I think we used to do much better than that.
>>>
>>> So let me ask Gael and Keir:
>>> * Keir: what do you get on this benchmark? How did you get this result
>>> where ATLAS outperformed us by 30%?
>>> * Gael: suppose I want to get deeper into this, where do I start?
>>>
>>> Cheers,
>>> Benoit
>>
>
>