RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
From: <Daniel.Vollmer@xxxxxx>
Date: Wed, 1 Aug 2018 16:57:10 +0000
Accept-language: de-DE, en-US
Thread-index: AQHUKW4a0jQxKdgpi069PfwZNC4APKSqkfAAgAAxJp7///jogIAAKgIu
Thread-topic: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hello Christoph,

unfortunately I have to revise the additional measurements I made for you earlier, it seems I was always using Eigen-3.2.9 during those measurements (which explains why they are nigh identical to the original measurements for Eigen-3.2). Sincere apologies. "Wer misst misst Mist."

Attached is a table where I've remeasured those runs. The original table was OK, except for clang's no unaligned vec (where it was actually still enabled as I miscopied the name of the define).

Now it seems that the performance difference I'm seeing is fairly independent of AVX usage (on both the compiler and the Eigen-side). Enabling AVX seems strictly better than not using it for Eigen 3.3 (no difference for "tau" case, improvement for "cgns" & "dg"). Still, there is a difference between Eigen 3.2 and 3.3 and using AVX recoups some of that difference, but (for our peculiar usage) Eigen-3.3 is still a net loss, particularly when using gcc.

> just to clear my confusion, by "partial vectorization" you mean
> "unaligned vectorization"?

Yes, that's what I meant. Sorry for the confusion. I've renamed the entries in the table accordingly.


I don't know whether you noticed it, but I did extract a partial example that seems to reproduce part of the phenomenon (at least for some of the behavior of the "tau" case) and attached it to an earlier email from today (replying to Marc Glisse), called eigen_bench2.cpp.


> Could you (manually) disable the AVX-detection in Eigen/Core, but
> compile with AVX enabled?

For this I removed EIGEN_VECTORIZE_AVX from the #ifdef __AVX__ block in Eigen/Core, and also had to change SSE/PacketMath.h as that checked for __FMA__ and subsequently used intrinsics whose header wasn't included).

This didn't make a (measurable) difference compared to just -msse4.2 -mtune=native.


Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Christoph Hertzberg [chtz@xxxxxxxxxxxxxxxxxxxxxxxx]
Gesendet: Mittwoch, 1. August 2018 15:04
An: eigen@xxxxxxxxxxxxxxxxxxx
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

just to clear my confusion, by "partial vectorization" you mean
"unaligned vectorization"?

I guess Eigen's AVX-usage is a likely issue in some situations then --
but it is really hard to fix without anything concrete.
Could you (manually) disable the AVX-detection in Eigen/Core, but
compile with AVX enabled?
And does enabling/disabling AVX with Eigen3.2 make a difference? (That
version had no AVX support, but there may be issues with switching
between AVX and non-AVX instructions).

You could also try to make a diff between the assembly generated by gcc
and clang. This may involve cleaning up the assembly "somehow", or
actually disassembling the binary. Alternatively, just manually compare
some likely candidates -- you can mark them using
     EIGEN_ASM_COMMENT("some label which is easy to find");

Christoph



On 2018-08-01 14:14, Daniel.Vollmer@xxxxxx wrote:
> Hi Christoph,
>
> (I cc'd the mailing list again.)
>
> The compilation units are rather big, so directly comparing the resulting code is difficult.
>
> I've run the test-cases for gcc-8.1 and clang-3.8 with -msse4.2 -mtune=native to disable AVX.
>
> This improves the situation for gcc, (except for the "tau" test-cases where it's only "close") and results in the same performance as Eigen-3.2. Disabling partial vec or enabling it doesn't seem to make a difference for that setting anymore.
>
> For clang disabling AVX is a slight win for "tau" vs. default settings, but a slight loss for cgns (where the matrix-vector product and AD plays a bigger role, see area 2&3).
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>
> Daniel Vollmer | AS C²A²S²E
> www.DLR.de
>
> ________________________________________
> Von: Christoph Hertzberg [chtz@xxxxxxxxxxxxxxxxxxxxxxxx]
> Gesendet: Mittwoch, 1. August 2018 12:34
> An: Vollmer, Daniel
> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
>
> Hi,
>
> could you also try compiling with `-DEIGEN_UNALIGNED_VECTORIZE=0` and
> with AVX disabled, e.g., using `-msse4.2 -mtune=native` -- alternatively
> also by commenting out the corresponding detection inside Eigen/Core
> (this would actually be nice, if it was controllable by command-line
> options).
> And of course, any combinations of these options would be interesting,
> if they make a difference.
>
> If you have sufficiently small compilation units, it might also be worth
> having a look at the difference between the generated assembler code --
> but that is usually more productive if you had singled out a reduced
> test-case.
>
>
> Cheers,
> Christoph
>
>
>
>
>
> On 2018-08-01 11:10, Daniel.Vollmer@xxxxxx wrote:
>> Hello everyone,
>>
>> with the recent release of 3.3.5 I've once again looked at upgrading from our currently used Eigen 3.2 to the current stable branch, but some performance regressions remain, which make this a difficult decision, as I'm unable to nail down the exact cause (probably because it's not a single one) and would prefer to not slow down the overall performance.
>>
>> I've attached a document with some performance measurements for different compilers, different Eigen versions, and 3 different test-cases for our code (tau, cgns, dg) that stress different areas / sizes.
>> The "vs best" column compares run-time against the overall best run-time, "vs same" only relative to shortest run-time with the same compiler (so essentially between different Eigen variants with the same compiler).
>> Eigen 3.2 version used was 3.2.9 + some backports of improvements to AutoDiffScalar
>> Eigen 3.3 version used was 3.3.5.
>> The tests were run on a Xeon E3-1276 v3 (with our code doing multi-threading, and Eigen configured to not use threading of its own). Minimum run-time of 4 runs.
>>
>> We use Eigen in a CFD code for 3 roughly distinct subject areas:
>> 1) fixed-size vectors (and some matrices) of doubles, direct access to individual values (with compile-time known indices) or segments, simple linear algebra, few matrix-vector products.
>> 2) same as 1, but using Eigen::AutoDiffScalar instead of double (building up a Jacobian)
>> 3) Fixed-size matrix-vector products (inside of a Block-Jacobi iteration, not using any of Eigen's solvers)
>>
>> For the different cases:
>> tau: Only uses 1), with vectors of sizes 5 and 8, matrices of size 5x5
>> cgns: Uses 1)-3), with vectors of sizes 6 and 13, matrices of size 6x6 (for both 1 and 3).
>> dg: Uses 1)-3), with vectors of sizes 5 and 8, matrices of size 5x5 (for 1) and 20x20 (for 3).
>>
>> The outcomes seem to be
>> - clang is generally fastest
>> - the performance regression is more pronounced for gcc
>> - (partial) vectorization seems to "hurt" simple direct access (area 1), disabling it improves performance (clang) or at least reduces the impact of Eigen 3.3 (gcc)
>>
>> If we were only looking at clang, I'd be nearly willing to advocate moving to 3.3 (with default settings), because only a regression for the "tau" case remains.
>>
>> Unfortunately, I'm at a loss at how to pin-point these any more, and attempts at extracting a reduced test-case / example that exhibits the same behavior have not been fruitful, and some profiling of the actual code between Eigen 3.2 and 3.3 does not seem to directly yield actionable information.
>>
>> If anyone has any ideas for things to try, I'm all ears. :)
>>
>> Either way, thanks for your helpful (and nice to use) library!
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>>
>> Daniel Vollmer | AS C²A²S²E
>> www.DLR.de
>>
>> ________________________________________
>> Von: Vollmer, Daniel
>> Gesendet: Donnerstag, 28. Juli 2016 12:46
>> An: eigen@xxxxxxxxxxxxxxxxxxx
>> Betreff: RE: [eigen] 3.3-beta2 released!
>>
>> Hi Gael,
>>
>>> Fixed: https://bitbucket.org/eigen/eigen/commits/e35a38ad89fe/
>>> With float I get a nearly x2 speedup for the above 5x5 matrix-vector
>>> products (compared to 3.2), and x1.4 speedup with double.
>>
>> I tried out this version (ca9bd08) and the results are as follows:
>> Note: the explicit solver pretty much only does residual evaluations,
>> whereas the implicit solver does a residual evaluation, followed by a
>> Jacobian computation (using AutoDiffScalar) and then a block-based
>> Gauss-Jacobi iteration where the blocks are 5x5 matrices to
>> approximately solve a linear system based on the Jacobian and the
>> residual.
>>
>> Explicit solver:
>> ----------------
>> eigen-3.3-ca9bd08                 10.9s => 09% slower
>> eigen-3.3-beta2                   11.1s => 11% slower
>> eigen-3.3-beta2 UNALIGNED_VEC=0   10.0s => 00% slower
>> eigen-3.2.9                       10.0s => baseline
>>
>> Implicit solver:
>> ----------------
>> eigen-3.3-ca9bd08                 34.2s => 06% faster
>> eigen-3.3-beta2                   37.5s => 03% slower
>> eigen-3.3-beta2 UNALIGNED_VEC=0   38.2s => 05% slower
>> eigen-3.2.9                       36.5s => baseline
>>
>> So the change definitely helps for the implicit solver (which has lots
>> of 5x5 by 5x1 double multiplies), but for the explicit solver the
>> overhead of unaligned vectorization doesn't pay off. Maybe the use of
>> 3D vectors (which used for geometric normals and coordinates) is
>> problematic because it's such a borderline case for vectorization?
>>
>> What I don't quite understand is the difference between 3.2.9 (which
>> doesn't vectorize the given matrix sizes) and 3.3-beta2 without
>> vectorization: Something in 3.3 is slower under those conditions, but
>> maybe it's not the matrix-vector multiplies, as it could also be
>> AutoDiffScalar being slower.
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>>
>> Daniel Vollmer | AS C²A²S²E
>> www.DLR.de
>>
>
>
> --
>    Dr.-Ing. Christoph Hertzberg
>
>    Besuchsadresse der Nebengeschäftsstelle:
>    DFKI GmbH
>    Robotics Innovation Center
>    Robert-Hooke-Straße 5
>    28359 Bremen, Germany
>
>    Postadresse der Hauptgeschäftsstelle Standort Bremen:
>    DFKI GmbH
>    Robotics Innovation Center
>    Robert-Hooke-Straße 1
>    28359 Bremen, Germany
>
>    Tel.:     +49 421 178 45-4021
>    Zentrale: +49 421 178 45-0
>    E-Mail:   christoph.hertzberg@xxxxxxx
>
>    Weitere Informationen: http://www.dfki.de/robotik
>    -----------------------------------------------------------------------
>    Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>    Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
>    Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
>    (Vorsitzender) Dr. Walter Olthoff
>    Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>    Amtsgericht Kaiserslautern, HRB 2313
>    Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
>    USt-Id.Nr.:    DE 148646973
>    Steuernummer:  19/672/50006
>    -----------------------------------------------------------------------
>


--
  Dr.-Ing. Christoph Hertzberg

  Besuchsadresse der Nebengeschäftsstelle:
  DFKI GmbH
  Robotics Innovation Center
  Robert-Hooke-Straße 5
  28359 Bremen, Germany

  Postadresse der Hauptgeschäftsstelle Standort Bremen:
  DFKI GmbH
  Robotics Innovation Center
  Robert-Hooke-Straße 1
  28359 Bremen, Germany

  Tel.:     +49 421 178 45-4021
  Zentrale: +49 421 178 45-0
  E-Mail:   christoph.hertzberg@xxxxxxx

  Weitere Informationen: http://www.dfki.de/robotik
  -----------------------------------------------------------------------
  Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
  Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
  Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
  (Vorsitzender) Dr. Walter Olthoff
  Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
  Amtsgericht Kaiserslautern, HRB 2313
  Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
  USt-Id.Nr.:    DE 148646973
  Steuernummer:  19/672/50006
  -----------------------------------------------------------------------

Attachment: 2018-07 Flucs Eigen Perf v3.pdf
Description: 2018-07 Flucs Eigen Perf v3.pdf

References:
- Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Daniel.Vollmer
- RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Daniel.Vollmer
- Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Christoph Hertzberg

Messages sorted by: [ date | thread ]
Prev by Date: RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
Next by Date: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
Previous by thread: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
Next by thread: [eigen] GSoC 2018 Project "Faster Matrix Algebra for ATLAS" - Multiplication of Hermitian Matrices in Eigen

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/