Re: [eigen] Eigen 3.4 release candidate 1! |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Hello all, I've compiled our CFD Code with Eigen 3.4-rc1 and ran a few benchmarks relative to Eigen 3.2.9. Unfortunately, I'm seeing the negative performance impact introduced by Eigen 3.3 continue to exist (which is why we're still hanging on to the 3.2 branch). I've compiled our code with Eigen 3.2.9 as well as 3.4-rc1 with both clang and gcc, on a macOS x86-64 system (Big Sur) with the following CPU feature flags machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 FDPEO SMEP BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD Compilation settings were: -std=c++17 -fopenmp -Ofast -march=native -fno-finite-math-only -DEIGEN_DONT_PARALLELIZE=1 (we split our work up among threads ourselves, but the benchmarks are single-threaded anyway). The clang version was homebrew's llvm: stable 11.1.0 (bottled); gcc was homebrew's gcc: stable 10.2.0 (bottled). We're using Eigen for small, fixed-size vectors and occasionally matrices of doubles (of varying lengths, e.g. 5, 6, 7, 8, 13, ...), mainly accessing individual elements, or fixed-size (length 3) segments at compile-time known offsets (indices). Occasionally, we also have matrix-vector products of these, but they probably play a smaller role. These are the "Res" (Residual computation) benchmarks, where we do this over a whole mesh with multiple loops, gradient computations, ... Then we do the same thing but use Eigen's (unsupported) AutoDiffScalar using a fairly big fixed-max-size (e.g. when our PDE has 6 state variables, then AutoDiffScalar has Eigen::Dynamic derivatives with a max-size of 6 * 24 + 1). These types are quite large, but it's still faster than fully dynamic heap allocation. These are the "Jac" (Jacobian computation) benchmarks, which otherwise largely mimic the "Res" benchmarks. https://www.maven.de/stuff/coda_benchmarks_eigen.pdf In these graphs everything is relative to the clang Eigen 3.2.9 baseline being 1.0. Higher numbers are faster (2.0 would be twice as fast as the clang Eigen 3.2.9 using build). The first two benchmarks (Closure/...) are actual micro-benchmarks. In the first one, we have an input vector with 6 doubles, and from that compute a vector with 13 doubles (the first six are the same as the input, and the remaining 7 are derived values computed from those input values). The 2nd version also augments gradients, which means it also has input gradients (Matrix<Matrix<double, 3, 1>, 6, 1>) and outputs augmented gradients (Matrix<Matrix<double, 3, 1>, 13, 1>). The next set of benchmarks is the residual computation (which performs multiple loops over a mesh with various computations). One (small) part of this is the Closure-stuff from the first two benchmarks. The final set is the computation / approximation of the derivative using AutoDiffScalar instead of double on a local level. For these benchmarks (Res and Jac), the sizes of the vectors & matrices roughly increases for each set of equations (Euler <= NavierStokes < RANSSAneg < ...). For the first microbenchmark Closure/AugmentGF we seem to be hitting a pathological case in the partial vectorization as disabling that seems to fix the problem. The input vector contains 6 double, the output 13. Remember this mainly uses individual values or segment<3> access to the momentum-sub-vector. The bigger "Res" benchmarks are generally a bit slower (even though not majorly so), but more so with gcc. The even bigger "Jac" benchmarks do see big slow-downs (performance 0.7 down do 0.3), which seem to get worse as we include more (and larger) AutoDiffScalars. We didn't see this relative decrease for the otherwise roughly similar "Res" benchmarks, so something seems to be strange there (either in AutoDiffScalar, or in the use of custom scalar types itself). I'm wondering whether some Eigen internal operations may accidentally (or purposely) create more (temporary) copies of scalars, which for a custom type might be costly... Unfortunately providing minimized, self-contained reproducers exhibiting the same behavior is quite difficult... :( I will try to work on figuring out a reproducer for the first microbenchmark where partial vectorization has a negative effect. If anyone has any ideas what could be happening (or things they would like me to try), I'm all ears. We really would like to move to a current Eigen version! Best regards Daniel Vollmer -------------------------- Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR) German Aerospace Center Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany Daniel Vollmer | AS C²A²S²E www.DLR.de ________________________________________ From: Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> Sent: Monday, 19 April 2021 10:14:15 To: eigen@xxxxxxxxxxxxxxxxxxx Subject: [eigen] Eigen 3.4 release candidate 1! Hello Eigen users! We are happy to announce the release of 3.4-rc1. This is a release candidate which is considered feature complete and only will get bug fixes until 3.4.0 is released. This is not yet a stable release! We encourage everyone, especially who is currently working with a 3.3 version to try upgrading to 3.4-rc1 and report any regressions with any compilers and architectures you are using (this includes run-time errors, compile errors and compiler warnings). Eigen should work with any C++ standard between C++03 and C++20, with GCC 4.8 (or later), Clang 3.3 (or later), MSVC 2012 (or later), and recent versions of CUDA, HIP, SYCL, on any target platform. If enabled by the compiler this includes SIMD vectorization for SSE/AVX/AVX512, AltiVec, MSA, NEON, SVE, ZVector. To publish test results, you can run the unit tests as described here: [https://eigen.tuxfamily.org/index.php?title=Tests] Please report issues until April 25th, or tell us if you are still in the process of testing. The 3.4 release will be the last Eigen version with C++03 support. Support of the 3.3 branch may soon be stopped. The next major version will certainly also stop supporting some older compiler versions. Regarding the version-numbering: There was previously a 3.4 branch with recently was deleted again, as it was done prematurely and not properly kept up-to-date. Unfortunately, this means that the version number (specified in Eigen/src/Core/util/Macros.h) had to be reset to 3.3.91 (the final 3.4 release will have version 3.4.0). Version test macros will work incorrectly with any commit from master between 2021-02-17 and today. We apologize for any inconvenience this may cause. Cheers, Christoph -- Dr.-Ing. Christoph Hertzberg Besuchsadresse der Nebengeschäftsstelle: DFKI GmbH Robotics Innovation Center Robert-Hooke-Straße 5 28359 Bremen, Germany Postadresse der Hauptgeschäftsstelle Standort Bremen: DFKI GmbH Robotics Innovation Center Robert-Hooke-Straße 1 28359 Bremen, Germany Tel.: +49 421 178 45-4021 Zentrale: +49 421 178 45-0 E-Mail: christoph.hertzberg@xxxxxxx Weitere Informationen: http://www.dfki.de/robotik ------------------------------------------------------------- Deutsches Forschungszentrum für Künstliche Intelligenz GmbH Trippstadter Straße 122, D-67663 Kaiserslautern, Germany Geschäftsführung: Prof. Dr. Antonio Krüger Vorsitzender des Aufsichtsrats: Dr. Gabriël Clemens Amtsgericht Kaiserslautern, HRB 2313 -------------------------------------------------------------
Attachment:
coda_benchmarks_eigen.pdf
Description: coda_benchmarks_eigen.pdf
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |