Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang*From*: Patrik Huber <patrikhuber@xxxxxxxxx>*Date*: Thu, 8 Feb 2018 20:08:08 +0000*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=Mybw89ZOQtZk/OmK8fzPRppwT0vGk/pEtbMyHMwx3NM=; b=ZYw41dAVjN4DA7tFzNbDmq8jl6x6aR6OxQ43XNb4TlvjlJc+mCSFwEMyKISnC9et7j gXpb+Fwyrkr9udRjgqd1BGKP48Q1xHeXJgLKXZ//68nn355FvHxfIePK0L3kfjQU+D5e ZXOeqG1R2w0FJH9VehoBjWUn9ns12du+gS6C3kxdFMIU5sMeF0A7opmIvnw8H0+7HK30 kDq4/BJ1ebqUD+YftqsFNj8vb5zOtw9rXnSmLWpXMdjpzvQEv28w0/wzJmuJwBvY1wWg Kw1tCt1rt4BAERjDVm8X2TT4479KYkQbt8P92aeGAvnsYAaZkaFL6ZPZsQp/GMlRsMS1 WEdw==

Hi all,

Thank you very much Oleg, Christoph and Edward! Absolutely fantastic that you are able to help! :-)

Edward, this is brilliant. After compiling my benchmark on my machine with the same flags but an added -D__FMA__ flag, I can see an 1.5-2x speed increase, and MSVC is as fast as gcc! Wow.

Btw, I also noticed the speed drop with /GL. I reported this to MS yesterday: https://developercommunity.visualstudio.com/content/problem/194951/gl-results-in-15-2x-worse-run-time.html

It seems like you solved why this happens. I also think /GL may even exhibit emitting AVX and AVX2 code too.

>>Apparently, one also needs to supply /fp:fast in addition to /arch:AVX2 to enable FMA code generation on MSVC..

I think this is incorrect though. I thin /fp:fast is not needed for MSVC to generate FMA code. Also, gcc and clang can generate FMA code without -ffast-math (which I guess is sort-of equivalent to /fp:fast).

So, I think we solved this already. The speed-gain is amazing. Can we include this detection mechanism for MSVC into the next Eigen release?

Thank you very much again to everyone,

Patrik

On 8 February 2018 at 19:17, Edward Lam <edward@xxxxxxxxxx> wrote:

PS. I should note that adding whole program optimization (even for a single .cpp file!) causes VS2017 to suddenly not generate FMA instructions again. So it's important to *NOT* use /GL with cl.exe..

On 2/8/2018 2:14 PM, Edward Lam wrote:

On 2/8/2018 10:19 AM, Christoph Hertzberg wrote:

Could you try writing a small AVX program which uses the `_mm256_fmadd_ps(a,b,c)` intrinsic and see if it compiles with MSVC? Perhaps only our `#ifdef __FMA__` test does not work with MSVC. (It would be interesting to know how to detect FMA support then)

That works! For detection, the documentation at https://msdn.microsoft.com/en-us/library/b0084kay.aspx suggests that perhaps this will work:

#if defined(_MSC_VER) && defined(__AVX2__)

#define __FMA__

#endif

For reference, recompiling the earlier test with the best options plus -D__FMA__ produces:

$ ./gemm_test # 325 fmadd instructions produced

1124 1215 1465

col major (checksum: 0) elapsed_ms: 962

row major (checksum: 0) elapsed_ms: 1021

--------

1730 1235 1758

col major (checksum: 0) elapsed_ms: 1798

row major (checksum: 0) elapsed_ms: 1805

--------

1116 1736 868

col major (checksum: 0) elapsed_ms: 712

row major (checksum: 0) elapsed_ms: 712

--------

1278 1323 788

col major (checksum: 0) elapsed_ms: 578

row major (checksum: 0) elapsed_ms: 584

--------

Compared to the same compiler options *without* -D__FMA__ :

$ ./gemm_test # 125 fmadd instructions produced

1124 1215 1465

col major (checksum: 0) elapsed_ms: 1245

row major (checksum: 0) elapsed_ms: 1160

--------

1730 1235 1758

col major (checksum: 0) elapsed_ms: 2071

row major (checksum: 0) elapsed_ms: 2066

--------

1116 1736 868

col major (checksum: 0) elapsed_ms: 905

row major (checksum: 0) elapsed_ms: 905

--------

1278 1323 788

col major (checksum: 0) elapsed_ms: 711

row major (checksum: 0) elapsed_ms: 720

--------

Cheers,

-Edward

Dr. Patrik Huber

Centre for Vision, Speech and Signal Processing

University of Surrey

Guildford, Surrey GU2 7XH

United Kingdom

Web: www.patrikhuber.ch

Mobile: +44 (0)7482 633 934

Centre for Vision, Speech and Signal Processing

University of Surrey

Guildford, Surrey GU2 7XH

United Kingdom

Web: www.patrikhuber.ch

Mobile: +44 (0)7482 633 934

**Follow-Ups**:

**References**:**[eigen] Matrix multiplication much slower on MSVC than on g++/clang***From:*Patrik Huber

**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang***From:*Gael Guennebaud

**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang***From:*Edward Lam

**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang***From:*Christoph Hertzberg

**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang***From:*Edward Lam

**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang***From:*Edward Lam

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang** - Next by Date:
**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang** - Previous by thread:
**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang** - Next by thread:
**Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |