Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
From: Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 8 Feb 2018 16:23:51 +0100

Clang and GCC seem(ed) to have some issues with FMA as well. We havesome inline-assembly inside the corresponding PacketMath header:


https://bitbucket.org/eigen/eigen/src/2355b229/Eigen/src/Core/arch/AVX/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-160


Christoph

On 2018-02-08 15:35, Oleg Shirokobrod wrote:

Hi Patrik,

Have a look at this link
https://developercommunity.visualstudio.com/content/problem/107145/inefficient-code-generation-for-fma-instructions.html

Best regards,

Oleg Shirokobrod


On Thu, Feb 8, 2018 at 3:27 PM, Patrik Huber <patrikhuber@xxxxxxxxx> wrote:

Hi Gael,

Thanks for the reply.
Information on the topic of MSVC and FMA seems a bit scarce. But this blog
post says that with /arch:AVX2, " The compiler will generate code that
includes AVX2 and FMA instructions. "(https://blogs.msdn.
microsoft.com/vcblog/2014/02/28/avx2-support-in-visual-studio-c-compiler/).
So I think that at least concerning the flags, the compiler should emit FMA
instructions, if it can.
Since the difference I'm seeing is around 1.5+, it is indeed highly likely
that you are correct and the MSVC code is so much slower because it doesn't
emit FMA instructions, compared to gcc & clang. But if the compiler flags
are set - why does it not emit FMA instructions?
Is there an FMA code path for MSVC in Eigen and are people in general
seeing MSVC using FMA when using Eigen?

Thank you and best wishes,

Patrik

On 8 February 2018 at 12:40, Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
wrote:

Hi,

I did not read carefully your email, but it seems that on the MSVC build
you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain
for matrix-matrix multiply, only FMA does (usually around 1.5). In
contrast, with gcc/clang -march=native activate all supported instruction
sets, including FMA on recent CPUs.

gael


On Wed, Feb 7, 2018 at 3:30 PM, Patrik Huber <patrikhuber@xxxxxxxxx>
wrote:

Hello,

I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with
various sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC
is consistently around 1.5-2x slower than g++ and clang, which is quite
huge.

Here are some examples. I'm of course using optimised builds in both
cases:

cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2
/Ob2 /nologo

1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666

And gcc:
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test

1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332

I fiddled around quite a lot with the MSVC flags but no other flag made
anything faster.

My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550,
which has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same speed as
MSVC:

1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551

This sort-of looks to me as if the MSVC optimiser cannot make use of
AVX2, if it is available on the CPU. It's just as slow as only with AVX,
while g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.

Interestingly if I use g++-7 on the i5, I'm getting extremely bad
results:
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287

I believe this looks like a performance regression in g++-7. So I don't
think this is relevant to the problem I'm seeing. I am trying to report
this to the GCC bugtracker but they make signing up extremely hard.


If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2,
then I am getting identical results. So it looks like with SSE3, MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and
-march=core2, it's around 50% slower than g++-5.

The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done
with something like Google Benchmark), but I'm getting very consistent
results.


So I guess my main question is:
Is there anything that the Eigen developers can do, either to enable
AVX2 on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC
optimiser problem?

FYI I reported this to MS: https://developercommunity
.visualstudio.com/content/problem/194955/vs-produces-code-th
at-is-15-2x-slower-than-gcc-and.html (with code attached, but the code
is not visible to non-MS-employees).

If you are interested in more background information and more
benchmarks, the whole thing originated here: https://github.com/Dobia
sd/frugally-deep/issues/9 (but it's quite a lengthy thread).

Thank you and best wishes,

Patrik


Benchmark code:

// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>

using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;

template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
     using namespace std::chrono;
     float checksum = 0.0f; // to prevent compiler from optimizing
everything away
     const auto start_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
     for (size_t i = 0; i < 10; ++i)
     {
         Mat a_rm(s1, s2);
         Mat b_rm(s2, s3);
         const auto c_rm = a_rm * b_rm;
         checksum += c_rm(0, 0);
     }
     const auto end_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
     const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
     std::cout << name << " (checksum: " << checksum << ") elapsed_ms: "
<< elapsed_ms << std::endl;
}
int main()
{
     //std::random_device rd;
     //std::mt19937 gen(0);
     //std::uniform_int_distribution<> dis(1, 2048);
     std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
     for (std::size_t i = 0; i < 12; ++i)
     {
         int s1 = vals[i++];//dis(gen);
         int s2 = vals[i++];//dis(gen);
         int s3 = vals[i];//dis(gen);
         std::cout << s1 << " " << s2 << " " << s3 << std::endl;
         run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
         run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
         std::cout << "--------" << std::endl;
     }
     return 0;
}


--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>



--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>


--
 Dr.-Ing. Christoph Hertzberg

 Besuchsadresse der Nebengeschäftsstelle:
 DFKI GmbH
 Robotics Innovation Center
 Robert-Hooke-Straße 5
 28359 Bremen, Germany

 Postadresse der Hauptgeschäftsstelle Standort Bremen:
 DFKI GmbH
 Robotics Innovation Center
 Robert-Hooke-Straße 1
 28359 Bremen, Germany

 Tel.:     +49 421 178 45-4021
 Zentrale: +49 421 178 45-0
 E-Mail:   christoph.hertzberg@xxxxxxx

 Weitere Informationen: http://www.dfki.de/robotik
 -----------------------------------------------------------------------
 Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
 Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
 Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
 (Vorsitzender) Dr. Walter Olthoff
 Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
 Amtsgericht Kaiserslautern, HRB 2313
 Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
 USt-Id.Nr.:    DE 148646973
 Steuernummer:  19/672/50006
 -----------------------------------------------------------------------

References:
- [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Patrik Huber
- Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Gael Guennebaud
- Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Patrik Huber
- Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Oleg Shirokobrod

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
Next by Date: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
Previous by thread: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
Next by thread: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/