Re: [eigen] Re: SGEMM benchmark result against ATLAS |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
2010/8/24 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>2010/8/24 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>I doubt your CPU run at 1.66 Ghz.
It is actually 1.60 GHz plus "turbo boost" as Rohit points out.
It is certainly faster than that
since this (very old) benchmark actually returns the number of madd
per second, so you can multiply your result by 2 to get the actual
number of floating point operations.
Ah! I see.
I also recommend the newer bench :
g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o
bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu 1.08927s 15.7719 GFLOPS (2.24222s)
blas real 1.09075s 15.7505 GFLOPS (2.24748s)
eigen cpu 1.04802s 16..3926 GFLOPS (2.19929s)
eigen real 1.04958s 16.3684 GFLOPS (2.2025s)
[bjacob@cahouette bench]$
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu 1.04808s 16.3918 GFLOPS (2.18119s)
blas real 1.04965s 16.3672 GFLOPS (2.18427s)
eigen cpu 1.09635s 15..6701 GFLOPS (2.19431s)
eigen real 1.09773s 15.6504 GFLOPS (2.19706s)
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas cpu 1.07202s 16.0258 GFLOPS (2.16692s)
blas real 1.07366s 16.0012 GFLOPS (2.17015s)
eigen cpu 1.10482s 15..5499 GFLOPS (2.21542s)
eigen real 1.10639s 15.5279 GFLOPS (2.21853s)
As Rohit mentions, I have to disable turbo boost if I want to measure efficiency.
OK, so my only way to disable turbo boost was to set scaling_max_frequency to something else than the baseline maximum. So I set it to 933000, the minimum value i.e. 933 MHz.
I got:blas cpu 3.16852s 5.42204 GFLOPS (6.34579s)
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas real 3.17619s 5.40895 GFLOPS (6.36257s)
eigen cpu 3.19321s 5.38012 GFLOPS (6.40601s)
eigen real 3.20105s 5.36695 GFLOPS (6.43292s)blas cpu 3.14275s 5.4665 GFLOPS (6.32277s)
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS && ./bench_gemm
L1 cache size = 32 KB
L2/L3 cache size = 6144 KB
Register blocking = 8 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 1536 x 256
blas real 3.16762s 5.42359 GFLOPS (6.36551s)
eigen cpu 3.14266s 5.46666 GFLOPS (6.40601s)
eigen real 3.15602s 5.44353 GFLOPS (6.4333s)
So let's compute: 5.46 / (0.933 * 8) = 0.73, so I am running at 73% of my CPU's theoretical maximum if we consider it can do 8 flops per cycle.
Isn't that less than what you had previously measured?
Benoit
Benoit
which reports more information about cache/block size and allow you to
tweak the cache parameters.
../bench_gemm h
to get help.
In your case the L3 cache size is really what matters.
gael.
On Tue, Aug 24, 2010 at 5:58 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> Ah! Last spam I promise.
>
> This 6 MB is actually a L3 cache, not L2 cache, for the record.
>
> My cpu has only 256 kB of L2 cache (per core) and 6 MB of L3 cache (shared
> among all cores).
>
> By contrast, Gael, all of your Core 2's cache is L2, there's no L3.
> (Information from wikipedia).
>
> So, I'm wondering if the problem is that my cache is not as fast as Eigen
> expects.
>
> Benoit
>
> 2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>
>> Confirming that Eigen apparently did correctly detect my cache parameters:
>>
>> std::cout << "l1: " << Eigen::l1CacheSize() << std::endl;
>> std::cout << "l2: " << Eigen::l2CacheSize() << std::endl;
>>
>> l1: 32768
>> l2: 6291456
>>
>> indeed I have 6 MB or L2 cache.
>>
>> Benoit
>>
>> 2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>>
>>> Hi,
>>>
>>> Hearing from Keir that he saw untuned ATLAS outperform us by a 30%
>>> margin, which would be very unusual, I ran our benchBlasGemm a bit. By the
>>> way, I updated it to make it compile, which involved removing the
>>> eigen_..._normal path which didn't look useful (?), hope it's OK. Also, it
>>> was missing a extern "C" around the cblas #include.
>>>
>>> So I installed the most optimized ATLAS package that I could on Fedora,
>>> built with SSE3.
>>>
>>> I compiled our benchmark with:
>>>
>>> cd eigen/bench/
>>> g++ -O3 -msse3 -I.. -L /usr/lib64/atlas/ benchBlasGemm.cpp -o
>>> benchBlasGemm -lrt -lcblas
>>>
>>> And ran it on some 4096x4096 matrices:
>>>
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.73982 (7.862 GFlops/s)
>>> eigen : 8.9491 (7.678 GFlops/s)
>>> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>> 4096 x 4096 x 4096
>>> cblas: 8.51913 (8.066 GFlops/s)
>>> eigen : 8.42922 (8.152 GFlops/s)
>>>
>>> So _my_ results show Eigen3 and ATLAS running at the same speed roughly,
>>> albeit with a great variability.
>>>
>>> This is still perplexing for 2 reasons:
>>> - we used to beat ATLAS by a wide margin.
>>> - the roughly 8 GFlops here are not too good. My CPU is a Core i7 at
>>> 1.66 GHz. So x4 (because of float) and x2 (pipelining of addps and mulps) we
>>> should aim at 13.33 GFlops. So we are running here at only 60% of the
>>> theoretical maximum; I think we used to do much better than that.
>>>
>>> So let me ask Gael and Keir:
>>> * Keir: what do you get on this benchmark? How did you get this result
>>> where ATLAS outperformed us by 30%?
>>> * Gael: suppose I want to get deeper into this, where do I start?
>>>
>>> Cheers,
>>> Benoit
>>
>
>
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |