Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!) |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: <eigen@xxxxxxxxxxxxxxxxxxx>*Subject*: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)*From*: <Daniel.Vollmer@xxxxxx>*Date*: Wed, 1 Aug 2018 09:10:31 +0000*Accept-language*: de-DE, en-US*Thread-index*: AQHUKW4a0jQxKdgpi069PfwZNC4APA==*Thread-topic*: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hello everyone, with the recent release of 3.3.5 I've once again looked at upgrading from our currently used Eigen 3.2 to the current stable branch, but some performance regressions remain, which make this a difficult decision, as I'm unable to nail down the exact cause (probably because it's not a single one) and would prefer to not slow down the overall performance. I've attached a document with some performance measurements for different compilers, different Eigen versions, and 3 different test-cases for our code (tau, cgns, dg) that stress different areas / sizes. The "vs best" column compares run-time against the overall best run-time, "vs same" only relative to shortest run-time with the same compiler (so essentially between different Eigen variants with the same compiler). Eigen 3.2 version used was 3.2.9 + some backports of improvements to AutoDiffScalar Eigen 3.3 version used was 3.3.5. The tests were run on a Xeon E3-1276 v3 (with our code doing multi-threading, and Eigen configured to not use threading of its own). Minimum run-time of 4 runs. We use Eigen in a CFD code for 3 roughly distinct subject areas: 1) fixed-size vectors (and some matrices) of doubles, direct access to individual values (with compile-time known indices) or segments, simple linear algebra, few matrix-vector products. 2) same as 1, but using Eigen::AutoDiffScalar instead of double (building up a Jacobian) 3) Fixed-size matrix-vector products (inside of a Block-Jacobi iteration, not using any of Eigen's solvers) For the different cases: tau: Only uses 1), with vectors of sizes 5 and 8, matrices of size 5x5 cgns: Uses 1)-3), with vectors of sizes 6 and 13, matrices of size 6x6 (for both 1 and 3). dg: Uses 1)-3), with vectors of sizes 5 and 8, matrices of size 5x5 (for 1) and 20x20 (for 3). The outcomes seem to be - clang is generally fastest - the performance regression is more pronounced for gcc - (partial) vectorization seems to "hurt" simple direct access (area 1), disabling it improves performance (clang) or at least reduces the impact of Eigen 3.3 (gcc) If we were only looking at clang, I'd be nearly willing to advocate moving to 3.3 (with default settings), because only a regression for the "tau" case remains. Unfortunately, I'm at a loss at how to pin-point these any more, and attempts at extracting a reduced test-case / example that exhibits the same behavior have not been fruitful, and some profiling of the actual code between Eigen 3.2 and 3.3 does not seem to directly yield actionable information. If anyone has any ideas for things to try, I'm all ears. :) Either way, thanks for your helpful (and nice to use) library! Best regards Daniel Vollmer -------------------------- Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR) German Aerospace Center Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany Daniel Vollmer | AS C²A²S²E www.DLR.de ________________________________________ Von: Vollmer, Daniel Gesendet: Donnerstag, 28. Juli 2016 12:46 An: eigen@xxxxxxxxxxxxxxxxxxx Betreff: RE: [eigen] 3.3-beta2 released! Hi Gael, > Fixed: https://bitbucket.org/eigen/eigen/commits/e35a38ad89fe/ > With float I get a nearly x2 speedup for the above 5x5 matrix-vector > products (compared to 3.2), and x1.4 speedup with double. I tried out this version (ca9bd08) and the results are as follows: Note: the explicit solver pretty much only does residual evaluations, whereas the implicit solver does a residual evaluation, followed by a Jacobian computation (using AutoDiffScalar) and then a block-based Gauss-Jacobi iteration where the blocks are 5x5 matrices to approximately solve a linear system based on the Jacobian and the residual. Explicit solver: ---------------- eigen-3.3-ca9bd08 10.9s => 09% slower eigen-3.3-beta2 11.1s => 11% slower eigen-3.3-beta2 UNALIGNED_VEC=0 10.0s => 00% slower eigen-3.2.9 10.0s => baseline Implicit solver: ---------------- eigen-3.3-ca9bd08 34.2s => 06% faster eigen-3.3-beta2 37.5s => 03% slower eigen-3.3-beta2 UNALIGNED_VEC=0 38.2s => 05% slower eigen-3.2.9 36.5s => baseline So the change definitely helps for the implicit solver (which has lots of 5x5 by 5x1 double multiplies), but for the explicit solver the overhead of unaligned vectorization doesn't pay off. Maybe the use of 3D vectors (which used for geometric normals and coordinates) is problematic because it's such a borderline case for vectorization? What I don't quite understand is the difference between 3.2.9 (which doesn't vectorize the given matrix sizes) and 3.3-beta2 without vectorization: Something in 3.3 is slower under those conditions, but maybe it's not the matrix-vector multiplies, as it could also be AutoDiffScalar being slower. Best regards Daniel Vollmer -------------------------- Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR) German Aerospace Center Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany Daniel Vollmer | AS C²A²S²E www.DLR.de

**Attachment:
2018-07 Eigen Compiler Perf.pdf**

**Follow-Ups**:**Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)***From:*Marc Glisse

**Messages sorted by:**[ date | thread ]- Prev by Date:
**[eigen] C++17 compatibility** - Next by Date:
**Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)** - Previous by thread:
**[eigen] C++17 compatibility** - Next by thread:
**Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |