Hearing from Keir that he saw untuned ATLAS outperform us by a 30% margin, which would be very unusual, I ran our benchBlasGemm a bit. By the way, I updated it to make it compile, which involved removing the eigen_...._normal path which didn't look useful (?), hope it's OK. Also, it was missing a extern "C" around the cblas #include.
So I installed the most optimized ATLAS package that I could on Fedora, built with SSE3.
[bjacob@cahouette bench]$ ./benchBlasGemm 4096 4096 x 4096 x 4096 cblas: 8.73982 (7.862 GFlops/s) eigen : 8.9491 (7.678 GFlops/s) [bjacob@cahouette bench]$ ./benchBlasGemm 4096
4096 x 4096 x 4096 cblas: 8.51913 (8.066 GFlops/s) eigen : 8.42922 (8..152 GFlops/s)
So _my_ results show Eigen3 and ATLAS running at the same speed roughly, albeit with a great variability.
This is still perplexing for 2 reasons:
- we used to beat ATLAS by a wide margin. - the roughly 8 GFlops here are not too good. My CPU is a Core i7 at 1.66 GHz. So x4 (because of float) and x2 (pipelining of addps and mulps) we should aim at 13.33 GFlops. So we are running here at only 60% of the theoretical maximum; I think we used to do much better than that.
So let me ask Gael and Keir: * Keir: what do you get on this benchmark? How did you get this result where ATLAS outperformed us by 30%? * Gael: suppose I want to get deeper into this, where do I start?