Hi,
I did not read carefully your email, but it seems that on the MSVC
build you are missing FMA. Indeed, Compared to AVX, AVX2 does not
bring any gain for matrix-matrix multiply, only FMA does (usually
around 1.5). In contrast, with gcc/clang -march=native activate all
supported instruction sets, including FMA on recent CPUs.
gael
On Wed, Feb 7, 2018 at 3:30 PM, Patrik Huber <patrikhuber@xxxxxxxxx
<mailto:patrikhuber@xxxxxxxxx>> wrote:
Hello,
I noticed that code I'm using is around 2x slower on VS2017
(15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I
found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with
various
sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is
consistently around 1.5-2x slower than g++ and clang, which is
quite huge.
Here are some examples. I'm of course using optimised builds in
both cases:
cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2
/O2 /Ob2
/nologo
1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666
And gcc:
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native
-O3 -o
gcc7_gemm_test
1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332
I fiddled around quite a lot with the MSVC flags but no other flag
made
anything faster.
My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older
i5-3550, which
has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same
speed as MSVC:
1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551
This sort-of looks to me as if the MSVC optimiser cannot make use
of AVX2,
if it is available on the CPU. It's just as slow as only with AVX,
while g++
and clang can really make use of AVX2 and get a 1.5-2x speed-up.
Interestingly if I use g++-7 on the i5, I'm getting extremely bad
results:
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287
I believe this looks like a performance regression in g++-7. So I
don't
think this is relevant to the problem I'm seeing. I am trying to
report this
to the GCC bugtracker but they make signing up extremely hard.
If I use MSVC without the /arch:AVX2 switch, and g++5 with
-march=core2,
then I am getting identical results. So it looks like with SSE3,
MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and
-march=core2,
it's around 50% slower than g++-5.
The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be
done with
something like Google Benchmark), but I'm getting very consistent
results.
So I guess my main question is:
Is there anything that the Eigen developers can do, either to
enable AVX2 on
MSVC, or to help the MSVC optimiser? Or is it purely a MSVC
optimiser problem?
FYI I reported this to MS:
https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html
<https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html>
(with code attached, but the code is not visible to
non-MS-employees).
If you are interested in more background information and more
benchmarks,
the whole thing originated here:
https://github.com/Dobiasd/frugally-deep/issues/9
<https://github.com/Dobiasd/frugally-deep/issues/9> (but it's quite a
lengthy thread).
Thank you and best wishes,
Patrik
Benchmark code:
// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing
everything away
const auto start_time_ns =
high_resolution_clock::now().time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns =
high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ")
elapsed_ms: " <<
elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758,
1116,
1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
return 0;
}
-- Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch <http://www.patrikhuber.ch>
Mobile: +44 (0)7482 633 934 <tel:+44%207482%20633934>