Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang

On Thu, Feb 8, 2018 at 3:27 PM, Patrik Huber <patrikhuber@xxxxxxxxx> wrote:

Hi Gael,

Thanks for the reply.
Information on the topic of MSVC and FMA seems a bit scarce. But this blog post says that with /arch:AVX2, " The compiler will generate code that includes AVX2 and FMA instructions. "(https://blogs.msdn.microsoft.com/vcblog/2014/02/28/avx2-support-in-visual-studio-c-compiler/). So I think that at least concerning the flags, the compiler should emit FMA instructions, if it can.
Since the difference I'm seeing is around 1.5+, it is indeed highly likely that you are correct and the MSVC code is so much slower because it doesn't emit FMA instructions, compared to gcc & clang. But if the compiler flags are set - why does it not emit FMA instructions?
Is there an FMA code path for MSVC in Eigen and are people in general seeing MSVC using FMA when using Eigen?

Thank you and best wishes,

Patrik

On 8 February 2018 at 12:40, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:
Hi,

I did not read carefully your email, but it seems that on the MSVC build you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain for matrix-matrix multiply, only FMA does (usually around 1.5). In contrast, with gcc/clang -march=native activate all supported instruction sets, including FMA on recent CPUs.

gael

On Wed, Feb 7, 2018 at 3:30 PM, Patrik Huber <patrikhuber@xxxxxxxxx> wrote:
Hello,

I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and 15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with various sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is consistently around 1.5-2x slower than g++ and clang, which is quite huge.

Here are some examples. I'm of course using optimised builds in both cases:

cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2 /Ob2 /nologo

1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666

And gcc:
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc7_gemm_test

1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332

I fiddled around quite a lot with the MSVC flags but no other flag made anything faster.

My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550, which has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same speed as MSVC:

1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551

This sort-of looks to me as if the MSVC optimiser cannot make use of AVX2, if it is available on the CPU. It's just as slow as only with AVX, while g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.

Interestingly if I use g++-7 on the i5, I'm getting extremely bad results:
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287

I believe this looks like a performance regression in g++-7. So I don't think this is relevant to the problem I'm seeing. I am trying to report this to the GCC bugtracker but they make signing up extremely hard.

If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2, then I am getting identical results. So it looks like with SSE3, MSVC and g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and -march=core2, it's around 50% slower than g++-5.

The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done with something like Google Benchmark), but I'm getting very consistent results.

So I guess my main question is:
Is there anything that the Eigen developers can do, either to enable AVX2 on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC optimiser problem?

FYI I reported this to MS: https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html (with code attached, but the code is not visible to non-MS-employees).

If you are interested in more background information and more benchmarks, the whole thing originated here: https://github.com/Dobiasd/frugally-deep/issues/9 (but it's quite a lengthy thread).

Thank you and best wishes,

Patrik

Benchmark code:

// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>

using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;

template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing everything away
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116, 1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
return 0;
}

--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934

--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934