[eigen] Re: SGEMM benchmark result against ATLAS

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/8/24 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> 4. Could you also please time dgemm?
>
> Will try when I find time..!

Sorry it took me so long! Very busy time at Mozilla.

So I benchmarked DGEMM with the following command line:

[bjacob@cahouette ~]$ cd eigen/bench/
[bjacob@cahouette bench]$ !g++
g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/ bench_gemm.cpp -o
bench_gemm -lrt -lf77blas -DHAVE_BLAS -DSCALAR=double && ./bench_gemm
^C
[bjacob@cahouette bench]$ g++ -O2 -DNDEBUG -I.. -L /usr/lib64/atlas/
bench_gemm.cpp -o bench_gemm -lrt -lf77blas -DHAVE_BLAS
-DSCALAR=double && ./bench_gemm
L1 cache size     = 32 KB
L2/L3 cache size  = 6144 KB
Register blocking = 4 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 768 x 256
blas  cpu         5.64829s      3.04161 GFLOPS  (11.3189s)
blas  real        5.65678s      3.03704 GFLOPS  (11.3361s)
eigen cpu         6.52074s      2.63465 GFLOPS  (13.0519s)
eigen real        6.53064s      2.63066 GFLOPS  (13.0717s)
[bjacob@cahouette bench]$ ./bench_gemm
L1 cache size     = 32 KB
L2/L3 cache size  = 6144 KB
Register blocking = 4 x 4
Matrix sizes = 2048x2048 * 2048x2048
blocking size (mc x kc) = 768 x 256
blas  cpu         5.583s        3.07717 GFLOPS  (11.2337s)
blas  real        5.59178s      3.07235 GFLOPS  (11.2508s)
eigen cpu         6.48265s      2.65013 GFLOPS  (13.0104s)
eigen real        6.50728s      2.6401 GFLOPS   (13.0516s)


This shows that contrary to floats, with doubles, we actually run a
bit slower than ATLAS on my system. If we compare the best CPU times,
we run 16% slower.

In terms of efficiency however, we do quite OK. The above results were
again with my CPU frequency blocked at 933 MHz. With doubles my CPU
should do 4 flop per cycle (2 scalars per packet, and 2 instructions
(mulpd and addpd) per cycle). So the maximum speed is 4 * 0.933 ==
3.73 GFLOPS.

So ATLAS's figure of 3.077 GFLOPS means 82% efficiency and Eigen3's
figure of 2.65 GFLOPS means 71% efficiency. Not bad when you consider
that doubles are harder to run fast than floats, as since the matrix
blocks are smaller there is more overhead.

To summarize efficiencies I recorded on my Core i7:
Eigen / float : 73%
Eigen / double : 71%
ATLAS / float : 73 %
ATLAS / double : 82 %

So the only "anomaly" is that ATLAS is amazingly efficient with doubles (!)

Benoit

>
> Benoit
>
>> Thanks
>> Franco
>>
>> On Tue, Aug 24, 2010 at 11:07 AM, Keir Mierle <mierle@xxxxxxxxx> wrote:
>>>
>>> A question for Benoit: Is this running the threaded of eigen and atlas?
>>> Keir
>>>
>>> On Tue, Aug 24, 2010 at 10:52 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>> wrote:
>>>>
>>>> I too have atlas 3.8.3, and am using gcc 4.4 on linux x86-64. So I
>>>> can't really conclude anything, sorry.
>>>> Benoit
>>>>
>>>> 2010/8/24 Francesco Callari <fgcallari@xxxxxxxxx>:
>>>> > Hmmm, I think this is the info I can share:
>>>> > ATLAS build configuration.
>>>> > ====================
>>>> > ATLAS v3.8.3
>>>> > GCC 4.<redacted>
>>>> > GLIBC 2.<redacted>
>>>> > Configuration flags: 64-bit build using the chosen gcc for everything
>>>> > compiler.
>>>> > cc=${TOP}/bin/gcc
>>>> > f77=${TOP}/bin/gfortran
>>>> > mhz=<redacted>
>>>> >
>>>> > ./configure \
>>>> >     -C xc ${cc} -C gc ${cc} -C ic ${cc} -C dm ${cc} -C sm ${cc} \
>>>> >     -C dk ${cc} -C sk ${cc} \
>>>> >     -C if ${f77} \
>>>> >     -b 64 \
>>>> >     -D c -DPentiumCPS=${mhz}
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Aug 24, 2010 at 10:39 AM, Franco Callari <fgc@xxxxxxxxxx>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >> ---------- Forwarded message ----------
>>>> >> From: Keir Mierle <mierle@xxxxxxxxx>
>>>> >> Date: Tue, Aug 24, 2010 at 1:19 AM
>>>> >> Subject: Fwd: SGEMM benchmark result against ATLAS
>>>> >>
>>>> >>
>>>> >> Hey, care to forward any info about how you configured ATLAS?
>>>> >>
>>>> >> ---------- Forwarded message ----------
>>>> >> From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>>> >> Date: Mon, Aug 23, 2010 at 8:45 PM
>>>> >> Subject: SGEMM benchmark result against ATLAS
>>>> >> To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
>>>> >> Cc: Keir Mierle <mierle@xxxxxxxxx>, Gael Guennebaud
>>>> >> <gael.guennebaud@xxxxxxxxx>
>>>> >>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> Hearing from Keir that he saw untuned ATLAS outperform us by a 30%
>>>> >> margin,
>>>> >> which would be very unusual, I ran our benchBlasGemm a bit. By the
>>>> >> way, I
>>>> >> updated it to make it compile, which involved removing the
>>>> >> eigen_..._normal
>>>> >> path which didn't look useful (?), hope it's OK. Also, it was missing
>>>> >> a
>>>> >> extern "C" around the cblas #include.
>>>> >>
>>>> >> So I installed the most optimized ATLAS package that I could on
>>>> >> Fedora,
>>>> >> built with SSE3.
>>>> >>
>>>> >> I compiled our benchmark with:
>>>> >>
>>>> >> cd eigen/bench/
>>>> >> g++ -O3 -msse3 -I.. -L /usr/lib64/atlas/ benchBlasGemm.cpp  -o
>>>> >> benchBlasGemm -lrt -lcblas
>>>> >>
>>>> >> And ran it on some 4096x4096 matrices:
>>>> >>
>>>> >> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>>> >> 4096 x 4096 x 4096
>>>> >> cblas: 8.73982 (7.862 GFlops/s)
>>>> >> eigen : 8.9491 (7.678 GFlops/s)
>>>> >> [bjacob@cahouette bench]$ ./benchBlasGemm 4096
>>>> >> 4096 x 4096 x 4096
>>>> >> cblas: 8.51913 (8.066 GFlops/s)
>>>> >> eigen : 8.42922 (8.152 GFlops/s)
>>>> >>
>>>> >> So _my_ results show Eigen3 and ATLAS running at the same speed
>>>> >> roughly,
>>>> >> albeit with a great variability.
>>>> >>
>>>> >> This is still perplexing for 2 reasons:
>>>> >>  - we used to beat ATLAS by a wide margin.
>>>> >>  - the roughly 8 GFlops here are not too good. My CPU is a Core i7 at
>>>> >> 1.66
>>>> >> GHz. So x4 (because of float) and x2 (pipelining of addps and mulps)
>>>> >> we
>>>> >> should aim at 13.33 GFlops. So we are running here at only 60% of the
>>>> >> theoretical maximum; I think we used to do much better than that.
>>>> >>
>>>> >> So let me ask Gael and Keir:
>>>> >> * Keir: what do you get on this benchmark? How did you get this result
>>>> >> where ATLAS outperformed us by 30%?
>>>> >> * Gael: suppose I want to get deeper into this, where do I start?
>>>> >>
>>>> >> Cheers,
>>>> >> Benoit
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Francesco Callari <fgc@xxxxxxxxxx>
>>>> >>
>>>> >>             EC67 BEBE 62AC 8415 7591  2B12 A6CD D5EE D8CB D0ED
>>>> >>
>>>> >> Violence is the last refuge of the incompetent  (I. Asimov)
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Franco Callari <fgcallari@xxxxxxxxx>
>>>> >
>>>> >             EC67 BEBE 62AC 8415 7591  2B12 A6CD D5EE D8CB D0ED
>>>> >
>>>> > I am not bound to win, but I am bound to be true. I am not bound to
>>>> > succeed,
>>>> > but I am bound to live by the light that I have. (Abraham Lincoln)
>>>> >
>>>
>>
>>
>>
>> --
>> Franco Callari <fgcallari@xxxxxxxxx>
>>
>>             EC67 BEBE 62AC 8415 7591  2B12 A6CD D5EE D8CB D0ED
>>
>> I am not bound to win, but I am bound to be true. I am not bound to succeed,
>> but I am bound to live by the light that I have. (Abraham Lincoln)
>>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/