Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
From: Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 8 Feb 2018 16:19:01 +0100

Could you try writing a small AVX program which uses the`_mm256_fmadd_ps(a,b,c)` intrinsic and see if it compiles with MSVC?Perhaps only our `#ifdef __FMA__` test does not work with MSVC. (Itwould be interesting to know how to detect FMA support then)


Christoph

On 2018-02-08 15:04, Edward Lam wrote:

Apparently, one also needs to supply /fp:fast in addition to /arch:AVX2to enable FMA code generation on MSVC.
However, even after I did this, I did not see a speed improvement inPatrick's benchmark code when using Eigen 3.3.4, VS2017 15.5.5 and thesecompiler options:
     /O2 /Fa /std:c++17 /arch:AVX2 /fp:fast
     /D_SILENCE_CXX17_NEGATORS_DEPRECATION_WARNING
I've even confirmed this by grepping through the disassembly aftercompiling gemm_test.cpp with these cl.exe options:
     $ grep vfmadd gemm_test.asm | wc -l
     125

So it doesn't appear to make a difference in this case.

-Edward

On 2/8/2018 7:40 AM, Gael Guennebaud wrote:
Hi,
I did not read carefully your email, but it seems that on the MSVCbuild you are missing FMA. Indeed, Compared to AVX, AVX2 does notbring any gain for matrix-matrix multiply, only FMA does (usuallyaround 1.5). In contrast, with gcc/clang -march=native activate allsupported instruction sets, including FMA on recent CPUs.
gael
On Wed, Feb 7, 2018 at 3:30 PM, Patrik Huber <patrikhuber@xxxxxxxxx<mailto:patrikhuber@xxxxxxxxx>> wrote:
    Hello,
I noticed that code I'm using is around 2x slower on VS2017(15.5.5 and 15.6.0 Preview) than on g++-7 and clang-6. After some digging, Ifound that
    it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication withvarious
    sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is
consistently around 1.5-2x slower than g++ and clang, which isquite huge.
Here are some examples. I'm of course using optimised builds inboth cases:
cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2/O2 /Ob2
    /nologo

    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 971
    row major (checksum: 0) elapsed_ms: 976
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 1771
    row major (checksum: 0) elapsed_ms: 1778
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 819
    row major (checksum: 0) elapsed_ms: 834
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 668
    row major (checksum: 0) elapsed_ms: 666

    And gcc:
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native-O3 -o
    gcc7_gemm_test

    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 696
    row major (checksum: 0) elapsed_ms: 706
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 1294
    row major (checksum: 0) elapsed_ms: 1326
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 425
    row major (checksum: 0) elapsed_ms: 418
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 321
    row major (checksum: 0) elapsed_ms: 332
I fiddled around quite a lot with the MSVC flags but no other flagmade
    anything faster.

    My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an olderi5-3550, which
    has AVX, but not AVX2.
    The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the samespeed as MSVC:
    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 946
    row major (checksum: 0) elapsed_ms: 944
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 1798
    row major (checksum: 0) elapsed_ms: 1816
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 687
    row major (checksum: 0) elapsed_ms: 692
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 535
    row major (checksum: 0) elapsed_ms: 551
This sort-of looks to me as if the MSVC optimiser cannot make useof AVX2, if it is available on the CPU. It's just as slow as only with AVX,while g++
    and clang can really make use of AVX2 and get a 1.5-2x speed-up.
Interestingly if I use g++-7 on the i5, I'm getting extremely badresults:
    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 2007
    row major (checksum: 0) elapsed_ms: 2019
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 3941
    row major (checksum: 0) elapsed_ms: 3923
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 1625
    row major (checksum: 0) elapsed_ms: 1624
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 1276
    row major (checksum: 0) elapsed_ms: 1287
I believe this looks like a performance regression in g++-7. So Idon't think this is relevant to the problem I'm seeing. I am trying toreport this
    to the GCC bugtracker but they make signing up extremely hard.
If I use MSVC without the /arch:AVX2 switch, and g++5 with-march=core2, then I am getting identical results. So it looks like with SSE3,MSVC and
    g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and-march=core2,
    it's around 50% slower than g++-5.

    The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better bedone with something like Google Benchmark), but I'm getting very consistentresults.
    So I guess my main question is:
Is there anything that the Eigen developers can do, either toenable AVX2 on MSVC, or to help the MSVC optimiser? Or is it purely a MSVCoptimiser problem?
    FYI I reported this to MS:
https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html<https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html> (with code attached, but the code is not visible tonon-MS-employees).
If you are interested in more background information and morebenchmarks,
    the whole thing originated here:
    https://github.com/Dobiasd/frugally-deep/issues/9
    <https://github.com/Dobiasd/frugally-deep/issues/9> (but it's quite a
    lengthy thread).

    Thank you and best wishes,

    Patrik


    Benchmark code:

    // gemm_test.cpp
    #include <array>
    #include <chrono>
    #include <iostream>
    #include <random>
    #include <Eigen/Dense>

    using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
    Eigen::Dynamic, Eigen::RowMajor>;
    using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
    Eigen::Dynamic, Eigen::ColMajor>;

    template <typename Mat>
    void run_test(const std::string& name, int s1, int s2, int s3)
    {
         using namespace std::chrono;
         float checksum = 0.0f; // to prevent compiler from optimizing
    everything away
         const auto start_time_ns =
    high_resolution_clock::now().time_since_epoch().count();
         for (size_t i = 0; i < 10; ++i)
         {
             Mat a_rm(s1, s2);
             Mat b_rm(s2, s3);
             const auto c_rm = a_rm * b_rm;
             checksum += c_rm(0, 0);
         }
         const auto end_time_ns =
    high_resolution_clock::now().time_since_epoch().count();
         const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ")elapsed_ms: " <<
    elapsed_ms << std::endl;
    }
    int main()
    {
         //std::random_device rd;
         //std::mt19937 gen(0);
         //std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758,1116,
    1736, 868, 1278, 1323, 788 };
         for (std::size_t i = 0; i < 12; ++i)
         {
             int s1 = vals[i++];//dis(gen);
             int s2 = vals[i++];//dis(gen);
             int s3 = vals[i];//dis(gen);
             std::cout << s1 << " " << s2 << " " << s3 << std::endl;
             run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
             run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
             std::cout << "--------" << std::endl;
         }
         return 0;
    }


    --     Dr. Patrik Huber
    Centre for Vision, Speech and Signal Processing
    University of Surrey
    Guildford, Surrey GU2 7XH
    United Kingdom

    Web: www.patrikhuber.ch <http://www.patrikhuber.ch>
    Mobile: +44 (0)7482 633 934 <tel:+44%207482%20633934>


--
 Dr.-Ing. Christoph Hertzberg

 Besuchsadresse der Nebengeschäftsstelle:
 DFKI GmbH
 Robotics Innovation Center
 Robert-Hooke-Straße 5
 28359 Bremen, Germany

 Postadresse der Hauptgeschäftsstelle Standort Bremen:
 DFKI GmbH
 Robotics Innovation Center
 Robert-Hooke-Straße 1
 28359 Bremen, Germany

 Tel.:     +49 421 178 45-4021
 Zentrale: +49 421 178 45-0
 E-Mail:   christoph.hertzberg@xxxxxxx

 Weitere Informationen: http://www.dfki.de/robotik
 -----------------------------------------------------------------------
 Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
 Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
 Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
 (Vorsitzender) Dr. Walter Olthoff
 Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
 Amtsgericht Kaiserslautern, HRB 2313
 Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
 USt-Id.Nr.:    DE 148646973
 Steuernummer:  19/672/50006
 -----------------------------------------------------------------------

Follow-Ups:
- Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Edward Lam

References:
- [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Patrik Huber
- Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Gael Guennebaud
- Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
  - From: Edward Lam

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
Next by Date: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
Previous by thread: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang
Next by thread: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/