Re: [eigen] Matrix multiplication much slower on MSVC than on g++/clang

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]




On Thu, Feb 8, 2018 at 8:14 PM, Edward Lam <edward@xxxxxxxxxx> wrote:

That works! For detection, the documentation at https://msdn.microsoft.com/en-us/library/b0084kay.aspx suggests that perhaps this will work:

#if defined(_MSC_VER) && defined(__AVX2__)
#define __FMA__
#endif

To implement that we need to make sure that on all architectures AVX2 => FMA. This seems to be true for Intel's ones, but I'm not sure about AMD.


gael

 

For reference, recompiling the earlier test with the best options plus -D__FMA__ produces:

$ ./gemm_test # 325 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 962
row major (checksum: 0) elapsed_ms: 1021
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1805
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 712
row major (checksum: 0) elapsed_ms: 712
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 578
row major (checksum: 0) elapsed_ms: 584
--------

Compared to the same compiler options *without* -D__FMA__ :

$ ./gemm_test # 125 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 1245
row major (checksum: 0) elapsed_ms: 1160
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 2071
row major (checksum: 0) elapsed_ms: 2066
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 905
row major (checksum: 0) elapsed_ms: 905
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 711
row major (checksum: 0) elapsed_ms: 720
--------


Cheers,
-Edward





Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/