[eigen] Re: SGEMM benchmark result against ATLAS

2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>

Confirming that Eigen apparently did correctly detect my cache parameters:

std::cout << "l1: " << Eigen::l1CacheSize() << std::endl;
std::cout << "l2: " << Eigen::l2CacheSize() << std::endl;

l1: 32768
l2: 6291456

indeed I have 6 MB or L2 cache.

Benoit

2010/8/23 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>

Hi,

Hearing from Keir that he saw untuned ATLAS outperform us by a 30% margin, which would be very unusual, I ran our benchBlasGemm a bit. By the way, I updated it to make it compile, which involved removing the eigen_..._normal path which didn't look useful (?), hope it's OK. Also, it was missing a extern "C" around the cblas #include.

So I installed the most optimized ATLAS package that I could on Fedora, built with SSE3.

I compiled our benchmark with:

cd eigen/bench/
g++ -O3 -msse3 -I.. -L /usr/lib64/atlas/ benchBlasGemm.cpp -o benchBlasGemm -lrt -lcblas

And ran it on some 4096x4096 matrices:

[bjacob@cahouette bench]$ ./benchBlasGemm 4096
4096 x 4096 x 4096
cblas: 8.73982 (7.862 GFlops/s)
eigen : 8.9491 (7.678 GFlops/s)
[bjacob@cahouette bench]$ ./benchBlasGemm 4096
4096 x 4096 x 4096
cblas: 8.51913 (8.066 GFlops/s)
eigen : 8.42922 (8..152 GFlops/s)

So _my_ results show Eigen3 and ATLAS running at the same speed roughly, albeit with a great variability.

This is still perplexing for 2 reasons:
- we used to beat ATLAS by a wide margin.
- the roughly 8 GFlops here are not too good. My CPU is a Core i7 at 1.66 GHz. So x4 (because of float) and x2 (pipelining of addps and mulps) we should aim at 13.33 GFlops. So we are running here at only 60% of the theoretical maximum; I think we used to do much better than that.

So let me ask Gael and Keir:
* Keir: what do you get on this benchmark? How did you get this result where ATLAS outperformed us by 30%?
* Gael: suppose I want to get deeper into this, where do I start?

Cheers,
Benoit