Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Hi,I did not read carefully your email, but it seems that on the MSVC build you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain for matrix-matrix multiply, only FMA does (usually around 1.5). In contrast, with gcc/clang -march=native activate all supported instruction sets, including FMA on recent CPUs.gaelOn Wed, Feb 7, 2018 at 3:30 PM, Patrik Huber <patrikhuber@xxxxxxxxx> wrote:Hello,I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and 15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that it is down to the matrix multiplication with Eigen.The simple benchmark (see below) tests matrix multiplication with various sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is consistently around 1.5-2x slower than g++ and clang, which is quite huge.Here are some examples. I'm of course using optimised builds in both cases:cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2 /Ob2 /nologo1124 1215 1465col major (checksum: 0) elapsed_ms: 971row major (checksum: 0) elapsed_ms: 976--------1730 1235 1758col major (checksum: 0) elapsed_ms: 1771row major (checksum: 0) elapsed_ms: 1778--------1116 1736 868col major (checksum: 0) elapsed_ms: 819row major (checksum: 0) elapsed_ms: 834--------1278 1323 788col major (checksum: 0) elapsed_ms: 668row major (checksum: 0) elapsed_ms: 666And gcc:g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc7_gemm_test1124 1215 1465col major (checksum: 0) elapsed_ms: 696row major (checksum: 0) elapsed_ms: 706--------1730 1235 1758col major (checksum: 0) elapsed_ms: 1294row major (checksum: 0) elapsed_ms: 1326--------1116 1736 868col major (checksum: 0) elapsed_ms: 425row major (checksum: 0) elapsed_ms: 418--------1278 1323 788col major (checksum: 0) elapsed_ms: 321row major (checksum: 0) elapsed_ms: 332I fiddled around quite a lot with the MSVC flags but no other flag made anything faster.My CPU is an i7-7700HQ with AVX2.Now interestingly, I've run the same benchmark on an older i5-3550, which has AVX, but not AVX2.The run time on MSVC is nearly identical.But now g++ (5.4) (again with -march=native) is nearly the same speed as MSVC:1124 1215 1465col major (checksum: 0) elapsed_ms: 946row major (checksum: 0) elapsed_ms: 944--------1730 1235 1758col major (checksum: 0) elapsed_ms: 1798row major (checksum: 0) elapsed_ms: 1816--------1116 1736 868col major (checksum: 0) elapsed_ms: 687row major (checksum: 0) elapsed_ms: 692--------1278 1323 788col major (checksum: 0) elapsed_ms: 535row major (checksum: 0) elapsed_ms: 551This sort-of looks to me as if the MSVC optimiser cannot make use of AVX2, if it is available on the CPU. It's just as slow as only with AVX, while g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.Interestingly if I use g++-7 on the i5, I'm getting extremely bad results:1124 1215 1465col major (checksum: 0) elapsed_ms: 2007row major (checksum: 0) elapsed_ms: 2019--------1730 1235 1758col major (checksum: 0) elapsed_ms: 3941row major (checksum: 0) elapsed_ms: 3923--------1116 1736 868col major (checksum: 0) elapsed_ms: 1625row major (checksum: 0) elapsed_ms: 1624--------1278 1323 788col major (checksum: 0) elapsed_ms: 1276row major (checksum: 0) elapsed_ms: 1287I believe this looks like a performance regression in g++-7. So I don't think this is relevant to the problem I'm seeing. I am trying to report this to the GCC bugtracker but they make signing up extremely hard.If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2, then I am getting identical results. So it looks like with SSE3, MSVC and g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.Again I'm seeing the same performance regression with g++7 and -march=core2, it's around 50% slower than g++-5.The Eigen version I used is 3.3.4.Btw I realise the benchmark is a bit crude (and might better be done with something like Google Benchmark), but I'm getting very consistent results.So I guess my main question is:Is there anything that the Eigen developers can do, either to enable AVX2 on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC optimiser problem?FYI I reported this to MS: https://developercommunity.visualstudio.com/content/prob (with code attached, but the code is not visible to non-MS-employees).lem/194955/vs-produces-code- that-is-15-2x-slower-than-gcc- and.html If you are interested in more background information and more benchmarks, the whole thing originated here: https://github.com/Dobiasd/frugally-deep/issues/9 (but it's quite a lengthy thread).Thank you and best wishes,PatrikBenchmark code:// gemm_test.cpp#include <array>#include <chrono>#include <iostream>#include <random>#include <Eigen/Dense>using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;template <typename Mat>void run_test(const std::string& name, int s1, int s2, int s3){using namespace std::chrono;float checksum = 0.0f; // to prevent compiler from optimizing everything awayconst auto start_time_ns = high_resolution_clock::now().time_since_epoch().count(); for (size_t i = 0; i < 10; ++i){Mat a_rm(s1, s2);Mat b_rm(s2, s3);const auto c_rm = a_rm * b_rm;checksum += c_rm(0, 0);}const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count(); const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;}int main(){//std::random_device rd;//std::mt19937 gen(0);//std::uniform_int_distribution<> dis(1, 2048); std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116, 1736, 868, 1278, 1323, 788 };for (std::size_t i = 0; i < 12; ++i){int s1 = vals[i++];//dis(gen);int s2 = vals[i++];//dis(gen);int s3 = vals[i];//dis(gen);std::cout << s1 << " " << s2 << " " << s3 << std::endl;run_test<ColMajorMatrixXf>("col major", s1, s2, s3); run_test<RowMajorMatrixXf>("row major", s1, s2, s3); std::cout << "--------" << std::endl;}return 0;}--Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |