Re: [eigen] Re: SGEMM benchmark result against ATLAS

2010/8/24 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>

2010/8/24 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>

2010/8/24 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>

I doubt your CPU run at 1.66 Ghz.

It is actually 1.60 GHz plus "turbo boost" as Rohit points out.

It is certainly faster than that
since this (very old) benchmark actually returns the number of madd
per second, so you can multiply your result by 2 to get the actual
number of floating point operations.

Ah! I see.

I also recommend the newer bench :

g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o
bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm

[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm

L1 cache size     = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu         1.08927s      15.7719 GFLOPS (2.24222s)
blas real        1.09075s      15.7505 GFLOPS (2.24748s)
eigen cpu         1.04802s      16..3926 GFLOPS (2.19929s)
eigen real        1.04958s      16.3684 GFLOPS (2.2025s)
[bjacob@cahouette bench]$
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size     = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu         1.04808s      16.3918 GFLOPS (2.18119s)
blas real        1.04965s      16.3672 GFLOPS (2.18427s)
eigen cpu         1.09635s      15..6701 GFLOPS (2.19431s)
eigen real        1.09773s      15.6504 GFLOPS (2.19706s)
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size     = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu         1.07202s      16.0258 GFLOPS (2.16692s)
blas real        1.07366s      16.0012 GFLOPS (2.17015s)
eigen cpu         1.10482s      15..5499 GFLOPS (2.21542s)
eigen real        1.10639s      15.5279 GFLOPS (2.21853s)

As Rohit mentions, I have to disable turbo boost if I want to measure efficiency.

OK, so my only way to disable turbo boost was to set scaling_max_frequency to something else than the baseline maximum. So I set it to 933000, the minimum value i.e. 933 MHz.

I got:

[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size     = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu         3.16852s      5.42204 GFLOPS (6.34579s)
blas real        3.17619s      5.40895 GFLOPS (6.36257s)
eigen cpu         3.19321s      5.38012 GFLOPS (6.40601s)
eigen real        3.20105s      5.36695 GFLOPS (6.43292s)

[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size     = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu         3.14275s      5.4665 GFLOPS   (6.32277s)
blas real        3.16762s      5.42359 GFLOPS (6.36551s)
eigen cpu         3.14266s      5.46666 GFLOPS (6.40601s)
eigen real        3.15602s      5.44353 GFLOPS (6.4333s)

So let's compute: 5.46 / (0.933 * 8) = 0.73, so I am running at 73% of my CPU's theoretical maximum if we consider it can do 8 flops per cycle.

Isn't that less than what you had previously measured?

Basically, my fear is that being very efficient on a Core i7 would require another level of blocking.

Benoit

Benoit

Benoit

which reports more information about cache/block size and allow you to
tweak the cache parameters.

../bench_gemm h

to get help.

In your case the L3 cache size is really what matters.

gael.

On Tue, Aug 24, 2010 at 5:58 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> Ah! Last spam I promise.
>
> This 6 MB is actually a L3 cache, not L2 cache, for the record.
>
> My cpu has only 256 kB of L2 cache (per core) and 6 MB of L3 cache (shared
> among all cores).
>
> By contrast, Gael, all of your Core 2's cache is L2, there's no L3.
> (Information from wikipedia).
>
> So, I'm wondering if the problem is that my cache is not as fast as Eigen
> expects.
>
> Benoit
>
> 2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>
>> Confirming that Eigen apparently did correctly detect my cache parameters:
>>
>> std::cout << "l1: " << Eigen::l1CacheSize() << std::endl;
>> std::cout << "l2: " << Eigen::l2CacheSize() << std::endl;
>>
>> l1: 32768
>> l2: 6291456
>>
>> indeed I have 6 MB or L2 cache.
>>
>> Benoit
>>
>> 2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>>
>>> Hi,
>>>
>>> Hearing from Keir that he saw untuned ATLAS outperform us by a 30%
>>> margin, which would be very unusual, I ran our benchBlasGemm a bit. By the
>>> way, I updated it to make it compile, which involved removing the
>>> eigen_..._normal path which didn't look useful (?), hope it's OK. Also, it
>>> was missing a extern "C" around the cblas #include.
>>>
>>> So I installed the most optimized ATLAS package that I could on Fedora,
>>> built with SSE3.
>>>
>>> I compiled our benchmark with:
>>>
>>> cd eigen/bench/
>>> g++ -O3 -msse3 -I.. -L /usr/lib64/atlas/ benchBlasGemm.cpp -o
>>> benchBlasGemm -lrt -lcblas
>>>
>>> And ran it on some 4096x4096 matrices:
>>>
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.73982 (7.862 GFlops/s)
>>> eigen : 8.9491 (7.678 GFlops/s)
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.51913 (8.066 GFlops/s)
>>> eigen : 8.42922 (8.152 GFlops/s)
>>>
>>> So _my_ results show Eigen3 and ATLAS running at the same speed roughly,
>>> albeit with a great variability.
>>>
>>> This is still perplexing for 2 reasons:
>>> - we used to beat ATLAS by a wide margin.
>>> - the roughly 8 GFlops here are not too good. My CPU is a Core i7 at
>>> 1.66 GHz. So x4 (because of float) and x2 (pipelining of addps and mulps) we
>>> should aim at 13.33 GFlops. So we are running here at only 60% of the
>>> theoretical maximum; I think we used to do much better than that.
>>>
>>> So let me ask Gael and Keir:
>>> * Keir: what do you get on this benchmark? How did you get this result
>>> where ATLAS outperformed us by 30%?
>>> * Gael: suppose I want to get deeper into this, where do I start?
>>>
>>> Cheers,
>>> Benoit
>>
>
>