Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Hi Gael,Thanks for the reply.Information on the topic of MSVC and FMA seems a bit scarce. But this blog post says that with /arch:AVX2, " The compiler will generate code that includes AVX2 and FMA instructions. "(https://blogs.msdn.microsoft.com/vcblog/2014/02/ ). So I think that at least concerning the flags, the compiler should emit FMA instructions, if it can.28/avx2-support-in-visual- studio-c-compiler/ Since the difference I'm seeing is around 1.5+, it is indeed highly likely that you are correct and the MSVC code is so much slower because it doesn't emit FMA instructions, compared to gcc & clang. But if the compiler flags are set - why does it not emit FMA instructions?Is there an FMA code path for MSVC in Eigen and are people in general seeing MSVC using FMA when using Eigen?Thank you and best wishes,PatrikOn 8 February 2018 at 12:40, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:Hi,I did not read carefully your email, but it seems that on the MSVC build you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain for matrix-matrix multiply, only FMA does (usually around 1.5). In contrast, with gcc/clang -march=native activate all supported instruction sets, including FMA on recent CPUs.gaelOn Wed, Feb 7, 2018 at 3:30 PM, Patrik Huber <patrikhuber@xxxxxxxxx> wrote:Hello,I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and 15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that it is down to the matrix multiplication with Eigen.The simple benchmark (see below) tests matrix multiplication with various sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is consistently around 1.5-2x slower than g++ and clang, which is quite huge.Here are some examples. I'm of course using optimised builds in both cases:cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2 /Ob2 /nologo1124 1215 1465col major (checksum: 0) elapsed_ms: 971row major (checksum: 0) elapsed_ms: 976--------1730 1235 1758col major (checksum: 0) elapsed_ms: 1771row major (checksum: 0) elapsed_ms: 1778--------1116 1736 868col major (checksum: 0) elapsed_ms: 819row major (checksum: 0) elapsed_ms: 834--------1278 1323 788col major (checksum: 0) elapsed_ms: 668row major (checksum: 0) elapsed_ms: 666And gcc:g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc7_gemm_test1124 1215 1465col major (checksum: 0) elapsed_ms: 696row major (checksum: 0) elapsed_ms: 706--------1730 1235 1758col major (checksum: 0) elapsed_ms: 1294row major (checksum: 0) elapsed_ms: 1326--------1116 1736 868col major (checksum: 0) elapsed_ms: 425row major (checksum: 0) elapsed_ms: 418--------1278 1323 788col major (checksum: 0) elapsed_ms: 321row major (checksum: 0) elapsed_ms: 332I fiddled around quite a lot with the MSVC flags but no other flag made anything faster.My CPU is an i7-7700HQ with AVX2.Now interestingly, I've run the same benchmark on an older i5-3550, which has AVX, but not AVX2.The run time on MSVC is nearly identical.But now g++ (5.4) (again with -march=native) is nearly the same speed as MSVC:1124 1215 1465col major (checksum: 0) elapsed_ms: 946row major (checksum: 0) elapsed_ms: 944--------1730 1235 1758col major (checksum: 0) elapsed_ms: 1798row major (checksum: 0) elapsed_ms: 1816--------1116 1736 868col major (checksum: 0) elapsed_ms: 687row major (checksum: 0) elapsed_ms: 692--------1278 1323 788col major (checksum: 0) elapsed_ms: 535row major (checksum: 0) elapsed_ms: 551This sort-of looks to me as if the MSVC optimiser cannot make use of AVX2, if it is available on the CPU. It's just as slow as only with AVX, while g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.Interestingly if I use g++-7 on the i5, I'm getting extremely bad results:1124 1215 1465col major (checksum: 0) elapsed_ms: 2007row major (checksum: 0) elapsed_ms: 2019--------1730 1235 1758col major (checksum: 0) elapsed_ms: 3941row major (checksum: 0) elapsed_ms: 3923--------1116 1736 868col major (checksum: 0) elapsed_ms: 1625row major (checksum: 0) elapsed_ms: 1624--------1278 1323 788col major (checksum: 0) elapsed_ms: 1276row major (checksum: 0) elapsed_ms: 1287I believe this looks like a performance regression in g++-7. So I don't think this is relevant to the problem I'm seeing. I am trying to report this to the GCC bugtracker but they make signing up extremely hard.If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2, then I am getting identical results. So it looks like with SSE3, MSVC and g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.Again I'm seeing the same performance regression with g++7 and -march=core2, it's around 50% slower than g++-5.The Eigen version I used is 3.3.4.Btw I realise the benchmark is a bit crude (and might better be done with something like Google Benchmark), but I'm getting very consistent results.So I guess my main question is:Is there anything that the Eigen developers can do, either to enable AVX2 on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC optimiser problem?FYI I reported this to MS: https://developercommunity.visualstudio.com/content/prob (with code attached, but the code is not visible to non-MS-employees).lem/194955/vs-produces-code-th at-is-15-2x-slower-than-gcc-an d.html If you are interested in more background information and more benchmarks, the whole thing originated here: https://github.com/Dobiasd/frugally-deep/issues/9 (but it's quite a lengthy thread).Thank you and best wishes,PatrikBenchmark code:// gemm_test.cpp#include <array>#include <chrono>#include <iostream>#include <random>#include <Eigen/Dense>using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;template <typename Mat>void run_test(const std::string& name, int s1, int s2, int s3){using namespace std::chrono;float checksum = 0.0f; // to prevent compiler from optimizing everything awayconst auto start_time_ns = high_resolution_clock::now().time_since_epoch().count(); for (size_t i = 0; i < 10; ++i){Mat a_rm(s1, s2);Mat b_rm(s2, s3);const auto c_rm = a_rm * b_rm;checksum += c_rm(0, 0);}const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count(); const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;}int main(){//std::random_device rd;//std::mt19937 gen(0);//std::uniform_int_distribution<> dis(1, 2048); std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116, 1736, 868, 1278, 1323, 788 };for (std::size_t i = 0; i < 12; ++i){int s1 = vals[i++];//dis(gen);int s2 = vals[i++];//dis(gen);int s3 = vals[i];//dis(gen);std::cout << s1 << " " << s2 << " " << s3 << std::endl;run_test<ColMajorMatrixXf>("col major", s1, s2, s3); run_test<RowMajorMatrixXf>("row major", s1, s2, s3); std::cout << "--------" << std::endl;}return 0;}--Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934--Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |