Re: [eigen] Eigen 3.4 release candidate 1!

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hello all,

I've compiled our CFD Code with Eigen 3.4-rc1 and ran a few benchmarks relative to Eigen 3.2.9.

Unfortunately, I'm seeing the negative performance impact introduced by Eigen 3.3 continue to exist (which is why we're still hanging on to the 3.2 branch).

I've compiled our code with Eigen 3.2.9 as well as 3.4-rc1 with both clang and gcc, on a macOS x86-64 system (Big Sur) with the following CPU feature flags
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 FDPEO SMEP BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD

Compilation settings were: -std=c++17 -fopenmp -Ofast -march=native -fno-finite-math-only -DEIGEN_DONT_PARALLELIZE=1
(we split our work up among threads ourselves, but the benchmarks are single-threaded anyway).

The clang version was homebrew's llvm: stable 11.1.0 (bottled); gcc was homebrew's gcc: stable 10.2.0 (bottled).

We're using Eigen for small, fixed-size vectors and occasionally matrices of doubles (of varying lengths, e.g. 5, 6, 7, 8, 13, ...), mainly accessing individual elements, or fixed-size (length 3) segments at compile-time known offsets (indices). Occasionally, we also have matrix-vector products of these, but they probably play a smaller role. These are the "Res" (Residual computation) benchmarks, where we do this over a whole mesh with multiple loops, gradient computations, ...

Then we do the same thing but use Eigen's (unsupported) AutoDiffScalar using a fairly big fixed-max-size (e.g. when our PDE has 6 state variables, then AutoDiffScalar has Eigen::Dynamic derivatives with a max-size of 6 * 24 + 1). These types are quite large, but it's still faster than fully dynamic heap allocation. These are the "Jac" (Jacobian computation) benchmarks, which otherwise largely mimic the "Res" benchmarks.

https://www.maven.de/stuff/coda_benchmarks_eigen.pdf

In these graphs everything is relative to the clang Eigen 3.2.9 baseline being 1.0. Higher numbers are faster (2.0 would be twice as fast as the clang Eigen 3.2.9 using build).

The first two benchmarks (Closure/...) are actual micro-benchmarks. In the first one, we have an input vector with 6 doubles, and from that compute a vector with 13 doubles (the first six are the same as the input, and the remaining 7 are derived values computed from those input values). The 2nd version also augments gradients, which means it also has input gradients (Matrix<Matrix<double, 3, 1>, 6, 1>) and outputs augmented gradients (Matrix<Matrix<double, 3, 1>, 13, 1>).

The next set of benchmarks is the residual computation (which performs multiple loops over a mesh with various computations). One (small) part of this is the Closure-stuff from the first two benchmarks.

The final set is the computation / approximation of the derivative using AutoDiffScalar instead of double on a local level.

For these benchmarks (Res and Jac), the sizes of the vectors & matrices roughly increases for each set of equations (Euler <= NavierStokes < RANSSAneg < ...).

For the first microbenchmark Closure/AugmentGF we seem to be hitting a pathological case in the partial vectorization as disabling that seems to fix the problem. The input vector contains 6 double, the output 13. Remember this mainly uses individual values or segment<3> access to the momentum-sub-vector.

The bigger "Res" benchmarks are generally a bit slower (even though not majorly so), but more so with gcc.

The even bigger "Jac" benchmarks do see big slow-downs (performance 0.7 down do 0.3), which seem to get worse as we include more (and larger) AutoDiffScalars. We didn't see this relative decrease for the otherwise roughly similar "Res" benchmarks, so something seems to be strange there (either in AutoDiffScalar, or in the use of custom scalar types itself). I'm wondering whether some Eigen internal operations may accidentally (or purposely) create more (temporary) copies of scalars, which for a custom type might be costly...


Unfortunately providing minimized, self-contained reproducers exhibiting the same behavior is quite difficult... :(

I will try to work on figuring out a reproducer for the first microbenchmark where partial vectorization has a negative effect.


If anyone has any ideas what could be happening (or things they would like me to try), I'm all ears. We really would like to move to a current Eigen version!


Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
From: Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>
Sent: Monday, 19 April 2021 10:14:15
To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: [eigen] Eigen 3.4 release candidate 1!

Hello Eigen users!

We are happy to announce the release of 3.4-rc1. This is a release
candidate which is considered feature complete and only will get bug
fixes until 3.4.0 is released.
This is not yet a stable release!

We encourage everyone, especially who is currently working with a 3.3
version to try upgrading to 3.4-rc1 and report any regressions with any
compilers and architectures you are using (this includes run-time
errors, compile errors and compiler warnings).
Eigen should work with any C++ standard between C++03 and C++20, with
GCC 4.8 (or later), Clang 3.3 (or later), MSVC 2012 (or later), and
recent versions of CUDA, HIP, SYCL, on any target platform. If enabled
by the compiler this includes SIMD vectorization for SSE/AVX/AVX512,
AltiVec, MSA, NEON, SVE, ZVector.

To publish test results, you can run the unit tests as described here:
[https://eigen.tuxfamily.org/index.php?title=Tests]
Please report issues until April 25th, or tell us if you are still in
the process of testing.


The 3.4 release will be the last Eigen version with C++03 support.
Support of the 3.3 branch may soon be stopped.

The next major version will certainly also stop supporting some older
compiler versions.


Regarding the version-numbering:
There was previously a 3.4 branch with recently was deleted again, as it
was done prematurely and not properly kept up-to-date.
Unfortunately, this means that the version number (specified in
Eigen/src/Core/util/Macros.h) had to be reset to 3.3.91 (the final 3.4
release will have version 3.4.0). Version test macros will work
incorrectly with any commit from master between 2021-02-17 and today. We
apologize for any inconvenience this may cause.


Cheers,
Christoph





--
  Dr.-Ing. Christoph Hertzberg

  Besuchsadresse der Nebengeschäftsstelle:
  DFKI GmbH
  Robotics Innovation Center
  Robert-Hooke-Straße 5
  28359 Bremen, Germany

  Postadresse der Hauptgeschäftsstelle Standort Bremen:
  DFKI GmbH
  Robotics Innovation Center
  Robert-Hooke-Straße 1
  28359 Bremen, Germany

  Tel.:     +49 421 178 45-4021
  Zentrale: +49 421 178 45-0
  E-Mail:   christoph.hertzberg@xxxxxxx

  Weitere Informationen: http://www.dfki.de/robotik
   -------------------------------------------------------------
   Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
   Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

   Geschäftsführung:
   Prof. Dr. Antonio Krüger

   Vorsitzender des Aufsichtsrats:
   Dr. Gabriël Clemens
   Amtsgericht Kaiserslautern, HRB 2313
   -------------------------------------------------------------


Attachment: coda_benchmarks_eigen.pdf
Description: coda_benchmarks_eigen.pdf



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/