| 
Hello again,
 
 I've tried extracting some pieces of the CFD code we're working on into independent benchmarks again to help nail down these regressions. 
 This time, I've mainly used gcc-7.3.0 due to the problems discovered with gcc-8; running on a 2.9 GHz Intel Core i9 (15" Macbook Pro), and I've been using the Google Benchmark library (https://github.com/google/benchmark) to get more reliable
 measurements. The code was compiled with 
 
g++-7
-march=native
-O3
-DNDEBUG
-Wno-deprecated-declarations
-ffast-math
-fno-finite-math-only
eigen_bench4.cpp
-I
/usr/local/Cellar/google-benchmark/1.4.1
-L
/usr/local/Cellar/google-benchmark/1.4.1/lib
-l
benchmark
-l
benchmark_main -I
eigen 
 So using AVX and FMA. Note that I'm not actually interested in the sum of the result vectors that often appear inside the benchmark::DoNotOptimize(...) calls, but I wanted something that depends on all values to prevent optimizing
 the complete _expression_ away. 
 Using Eigen-3.2.9 
maven@feuerlilie
~/D/d/d/eigen-perf>
./bench_32
--benchmark_repetitions=10
--benchmark_report_aggregates_only=true  
2018-08-22 15:08:04 
Running ./bench_32 
Run on (12 X 2900 MHz CPU s) 
CPU Caches: 
  L1 Data 32K (x6) 
  L1 Instruction 32K (x6) 
  L2 Unified 262K (x6) 
  L3 Unified 12582K (x1) 
---------------------------------------------------------------- 
Benchmark                         Time           CPU Iterations 
---------------------------------------------------------------- 
BM_Augment_mean         
        22 ns         22 ns
  31277228 
BM_Augment_median       
        22 ns         22 ns
  31277228 
BM_Augment_stddev                 0 ns          0 ns
  31277228 
BM_ConvectionFlux_mean  
        34 ns         34 ns
  20305570 
BM_ConvectionFlux_median
        34 ns         34 ns
  20305570 
BM_ConvectionFlux_stddev          0 ns          0 ns
  20305570 
BM_EigenMul_mean        
        23 ns         23 ns
  30841624 
BM_EigenMul_median      
        23 ns         23 ns
  30841624 
 
BM_EigenMul_stddev                0 ns          0 ns
  30841624 
 
Using Eigen-3.3.5
maven@feuerlilie
~/D/d/d/eigen-perf>
./bench_33
--benchmark_repetitions=10
--benchmark_report_aggregates_only=true  
2018-08-22 15:08:31 
Running ./bench_33 
Run on (12 X 2900 MHz CPU s) 
CPU Caches: 
  L1 Data 32K (x6) 
  L1 Instruction 32K (x6) 
  L2 Unified 262K (x6) 
  L3 Unified 12582K (x1) 
---------------------------------------------------------------- 
Benchmark                         Time           CPU Iterations 
---------------------------------------------------------------- 
BM_Augment_mean         
        26 ns         26 ns
  26066589 
BM_Augment_median       
        26 ns         26 ns
  26066589 
BM_Augment_stddev                 0 ns          0 ns
  26066589 
BM_ConvectionFlux_mean  
        41 ns         41 ns
  17082879 
BM_ConvectionFlux_median
        41 ns         41 ns
  17082879 
BM_ConvectionFlux_stddev          1 ns          1 ns
  17082879 
BM_EigenMul_mean        
        24 ns         24 ns
  29441577 
BM_EigenMul_median      
        24 ns         24 ns
  29441577 
 
BM_EigenMul_stddev                0 ns          0 ns
  29441577 
 First off, most of the run-time is from calling rand() a lot for each iteration, but that overhead should be the same between both version. As an aside, if you compile the code with the FIXED_INPUT #define then we can see that the compiler
 has an "easier" time to "see through" the older Eigen-3.2 and it is able to hoist the computation outside of the benchmark loop in more cases. This might also be part of the actual cause, because not all of the extra outputs of Augment() are always used, for
 example. 
 Generally, the run-time seems increased (unless I'm measuring wrong once again) for Eigen-3.3 (except for EigenMul). But I have difficulty to pin this on any particular change in Eigen. 
 clang (Apple LLVM version 9.1.0 (clang-902.0.39.2)) seems less problematic, and using it I don't
 see any performance regressions (other than compile time ;)). Unfortunately, most people will probably be using gcc with our code. 
 If you have any ideas on causes or possible improvements (either through changing the code, or modifications to Eigen), please let me know. As Eigen-3.3 has quite some fixes we'd like to pick up (AutoDiffScalar mainly, but also others),
 but at the moment it's a bit of a hard sell. 
 
Best regards
 
Daniel Vollmer
 
-------------------------- 
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR) 
German Aerospace Center 
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
 
Daniel Vollmer | AS C²A²S²E 
www.DLR.de
 
 Von: Gael Guennebaud [gael.guennebaud@xxxxxxxxx]Gesendet: Donnerstag, 16. August 2018 0:23
 An: eigen
 Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
 
 
 
ok, now regarding the difference between 3.2 and 3.3 with gcc 8, I found that changing the copies to:
 
 
Vec dirA; dirA = sum; Vec dirB; dirB = dirA; 
 
 
 gael
I still cannot reproduce, with -O2 -DNDEBUG I get the following assembly with Eigen 3.3 and gcc 7: 
 
L127: movapd
%xmm1, %xmm0 addl
$1, %eax mulpd
%xmm1, %xmm0 cmpl
%eax, %ebx movapd
%xmm0, %xmm2 unpckhpd
%xmm0, %xmm2 addsd
%xmm2, %xmm0 unpcklpd
%xmm0, %xmm0 mulpd
%xmm1, %xmm0 addpd
%xmm0, %xmm1 jne
L127 
 On my side, the only difference with 3.2 is the usage of haddpd with 3..2 that we disabled in 3.3 because it is pretty slow compared to movapd+unpcklpd+addps. 
 Checking on compiler-explorer (https://godbolt.org/g/9bkjfu ) it seems this particular issue is only reproducible with gcc 8, so definitely a gcc regression.
 gael 
 
Hello again,
 I've been trying to understand a bit better what is happening with the performance regression I'm seeing, and at the moment I am under the impression that Eigen-3.3 makes it harder (impossible?) for gcc to recognize when no aliasing is happening.
 
 I've further reduced my original example to essentially the following loop  (see eigen_bench3.cpp for a self-contained version).
 using Vec          = Eigen::Matrix<double, 2, 1>;
 Vec sum = Vec::Zero();
 for (int i = 0; i < num; ++i)
 {
 const Vec dirA = sum;
 const Vec dirB = dirA;
 
 sum += dirA.dot(dirB) * dirA;
 }
 
 When using Eigen-3.3, gcc-8 always spills back to memory when evaluating the _expression_, whereas when using Eigen-3.2 the loop runs completely without spilling. As there should be enough registers available, I assume this is due to "fear of aliasing".
 
 This is independent of vectorization and all the other nice things, it happens with vectorization disabled as well.
 
 Interestingly (?), when I change the size of Vec (from 2) to 1, then the generated code is identical for 3.3 and 3.2.
 
 The following is the x86-64 assembly for the relevant loop (only using -O1 and disabled vectorization because of increased clarity; the problem remains for higher optimization levels):
 
 Eigen-3.3:
 # eigen_bench3.cpp:18:   EIGEN_ASM_COMMENT("begin loop");
 test    ebx, ebx        # iftmp.0_3
 jle     .L131   #,
 # eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
 mov     edx, 0  # i,
 ..L132:
 # eigen-3.3.5/Eigen/src/Core/DenseStorage.h:194:     DenseStorage(const DenseStorage& other) : m_data(other.m_data) {
 mov     rsi, QWORD PTR [rsp+32] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
 mov     rdi, QWORD PTR [rsp+40] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
 mov     QWORD PTR [rsp], rsi    # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
 mov     QWORD PTR [rsp+8], rdi  # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
 mov     QWORD PTR [rsp+16], rsi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
 mov     QWORD PTR [rsp+24], rdi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
 # eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171:         const Packet& b) { return a*b; }
 vmovsd  xmm0, QWORD PTR [rsp+8] # _26, MEM[(const double &)&dirA + 8]
 vmovsd  xmm1, QWORD PTR [rsp]   # _28, MEM[(const double &)&dirA]
 # eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171:         const Packet& b) { return a*b; }
 vmulsd  xmm2, xmm0, QWORD PTR [rsp+24]  # tmp123, _26, MEM[(struct plain_array *)&dirB + 8B]
 vmulsd  xmm3, xmm1, QWORD PTR [rsp+16]  # tmp124, _28, MEM[(struct plain_array *)&dirB]
 # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:42:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a + b; }
 vaddsd  xmm2, xmm2, xmm3        # _30, tmp123, tmp124
 # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
 vmulsd  xmm1, xmm1, xmm2        # tmp125, _28, _30
 # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
 vaddsd  xmm1, xmm1, QWORD PTR [rsp+32]  # tmp126, tmp125, MEM[(double &)&sum]
 # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
 vmovsd  QWORD PTR [rsp+32], xmm1        # MEM[(double &)&sum], tmp126
 # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
 vmulsd  xmm0, xmm0, xmm2        # tmp128, _26, _30
 # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
 vaddsd  xmm0, xmm0, QWORD PTR [rsp+40]  # tmp129, tmp128, MEM[(double &)&sum + 8]
 # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
 vmovsd  QWORD PTR [rsp+40], xmm0        # MEM[(double &)&sum + 8], tmp129
 # eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
 inc     edx     # i
 # eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
 cmp     ebx, edx        # iftmp.0_3, i
 jne     .L132   #,
 ..L131:
 # eigen_bench3.cpp:28:   EIGEN_ASM_COMMENT("end   loop");
 
 
 
 
 And the same for Eigen-3.2
 
 # eigen_bench3.cpp:18:   EIGEN_ASM_COMMENT("begin loop");
 test    ebx, ebx        # iftmp.0_3
 jle     .L131   #,
 vmovsd  xmm2, QWORD PTR [rsp+16]        # sum__lsm.104, MEM[(const Scalar &)&sum]
 vmovsd  xmm1, QWORD PTR [rsp+24]        # sum__lsm.105, MEM[(const Scalar &)&sum + 8]
 # eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
 mov     edx, 0  # i,
 vmovsd  xmm4, QWORD PTR .LC6[rip]       # tmp117,
 ..L132:
 # eigen-3.2.9-mod/Eigen/src/Core/GenericPacketMath.h:114:         const Packet& b) { return a*b; }
 vmulsd  xmm0, xmm1, xmm1        # tmp114, sum__lsm.105, sum__lsm.105
 vmulsd  xmm3, xmm2, xmm2        # tmp115, sum__lsm.104, sum__lsm.104
 # eigen-3.2.9-mod/Eigen/src/Core/Functors.h:26:   EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a, const Scalar& b) const { return a + b; }
 vaddsd  xmm0, xmm0, xmm3        # tmp116, tmp114, tmp115
 vaddsd  xmm0, xmm0, xmm4        # _52, tmp116, tmp117
 vmulsd  xmm2, xmm2, xmm0        # sum__lsm.104, sum__lsm.104, _52
 vmulsd  xmm1, xmm1, xmm0        # sum__lsm.105, sum__lsm.105, _52
 # eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
 inc     edx     # i
 # eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
 cmp     ebx, edx        # iftmp.0_3, i
 jne     .L132   #,
 vmovsd  QWORD PTR [rsp+16], xmm2        # MEM[(const Scalar &)&sum], sum__lsm.104
 vmovsd  QWORD PTR [rsp+24], xmm1        # MEM[(const Scalar &)&sum + 8], sum__lsm..105
 ..L131:
 # eigen_bench3.cpp:28:   EIGEN_ASM_COMMENT("end   loop");
 
 
 Best regards
 
 Daniel Vollmer
 
 --------------------------
 Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
 German Aerospace Center
 Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
 
 Daniel Vollmer | AS C²A²S²E
 www.DLR.de
 ________________________________________
 Von: Gael Guennebaud [gael.guennebaud@xxxxxxxxx]
 Gesendet: Donnerstag, 2. August 2018 0:10
 An: eigen
 Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
 
 Hi,
 
 I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):
 
 // 3.2: 7.75s
 // 3.3: 7.68s
 
 // 3.2 -march=native: 7.46s
 // 3.3 -march=native: 7.48s
 
 I ran each test 4 times and keep the best of each, the variations were about 0.2s.
 
 gael
 
 
 On Wed, Aug 1, 2018 at 1:28 PM <Daniel.Vollmer@xxxxxx> wrote:
 Hi,
 
 extracting the relevant code is rather difficult, unfortunately.
 
 Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).
 
 If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate
 it.
 
 
 Best regards
 
 Daniel Vollmer
 
 --------------------------
 Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
 German Aerospace Center
 Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
 
 Daniel Vollmer | AS C²A²S²E
 www.DLR.de
 
 ________________________________________
 Von: Marc Glisse [marc.glisse@xxxxxxxx]
 Gesendet: Mittwoch, 1. August 2018 12:06
 An: 
eigen@xxxxxxxxxxxxxxxxxxx
 Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
 
 On Wed, 1 Aug 2018, 
Daniel.Vollmer@xxxxxx wrote:
 
 > I've attached a document with some performance measurements for
 > different compilers, different Eigen versions, and 3 different
 > test-cases for our code (tau, cgns, dg) that stress different areas /
 > sizes.
 
 Would it be hard to create small testcases that you could share and that
 would still show a similar performance pattern? I assume that the easier
 it is for Eigen dev to reproduce, the more likely they are to investigate.
 
 --
 Marc Glisse
 
 
 |