AW: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hello Gael,

if I only use SSE2 (that is without -march=native on a Haswell Xeon with AVX & FMA), then I also see no difference in the benchmark. The same is true for "-msse42" as well.

But if I use "-march=native" (which then enables AVX and FMA for Eigen-3.3) for *this particular example*, I see this ~13% slowdown. Can you confirm this?

For reference, the compilation command used was

g++ eigen_bench2.cpp -std=c++11 -Ofast -fno-finite-math-only -DNDEBUG -march=native -I eigen-3.2.9 / 3.3.5

I am no x86 assembly nor vectorization expert, so take the following with a lot of salt.

I have attached the generate assembly for the "hot loop". The first 49 lines are nearly the same, except for eigen-3.2 using %rbp-relative addressing and 3.3 using %rsp (with different offsets).

The remainder is more distinct. Eigen-3.3 doesn't use AVX registers as far as I can see, but uses more "...packed-double" instructions (Eigen 3.2 assembly doesn't seem to use any), but it seems the sequence is still (slightly) slower overall for Eigen-3.3.

eigen-3.2:

vmovsd -168(%rbp), %xmm3

vxorpd %xmm6, %xmm6, %xmm6

vmovsd .LC4(%rip), %xmm2

vmovsd -136(%rbp), %xmm8

vcvtsi2sd %eax, %xmm6, %xmm6

vfmadd132sd .LC3(%rip), %xmm2, %xmm6

vmulsd %xmm3, %xmm3, %xmm0

vmovsd -160(%rbp), %xmm2

vmovsd -176(%rbp), %xmm4

vmulsd %xmm8, %xmm3, %xmm1

vmovsd -144(%rbp), %xmm9

vmovsd -184(%rbp), %xmm5

vmovsd -152(%rbp), %xmm7

vfmadd231sd %xmm2, %xmm2, %xmm0

vfmadd231sd %xmm6, %xmm2, %xmm1

vfmadd231sd %xmm4, %xmm4, %xmm0

vmulsd .LC5(%rip), %xmm0, %xmm0

vfmadd231sd %xmm4, %xmm9, %xmm1

vdivsd %xmm5, %xmm0, %xmm0

vdivsd %xmm5, %xmm1, %xmm1

vsubsd %xmm0, %xmm7, %xmm0

vmulsd .LC6(%rip), %xmm0, %xmm0

vfmadd213sd -104(%rbp), %xmm1, %xmm4

vfmadd213sd -96(%rbp), %xmm1, %xmm3

vfmadd213sd -88(%rbp), %xmm1, %xmm2

vfmadd213sd -112(%rbp), %xmm1, %xmm5

vfmadd231sd %xmm9, %xmm0, %xmm4

vfmadd231sd %xmm8, %xmm0, %xmm3

vfmadd231sd %xmm6, %xmm0, %xmm2

vaddsd %xmm7, %xmm0, %xmm0

vmovsd %xmm5, -112(%rbp)

vfmadd213sd -80(%rbp), %xmm0, %xmm1

vmovsd %xmm4, -104(%rbp)

vmovsd %xmm3, -96(%rbp)

vmovsd %xmm2, -88(%rbp)

vmovsd %xmm1, -80(%rbp)

cmpl %ebx, %r13d

jl .L91

eigen-3.3:

vmovupd 120(%rsp), %xmm7

vmovapd 32(%rsp), %xmm8

vxorpd %xmm5, %xmm5, %xmm5

vcvtsi2sd %eax, %xmm5, %xmm5

vmovsd .LC4(%rip), %xmm4

vfmadd132sd .LC3(%rip), %xmm4, %xmm5

vmulpd %xmm7, %xmm7, %xmm1

vmovsd 16(%rsp), %xmm3

vmovsd 24(%rsp), %xmm4

vmovsd 8(%rsp), %xmm6

vunpckhpd %xmm1, %xmm1, %xmm2

vaddsd %xmm2, %xmm1, %xmm2

vmulpd %xmm8, %xmm7, %xmm1

vfmadd231sd %xmm3, %xmm3, %xmm2

vmulsd .LC5(%rip), %xmm2, %xmm2

vunpckhpd %xmm1, %xmm1, %xmm0

vaddsd %xmm0, %xmm1, %xmm0

vdivsd %xmm4, %xmm2, %xmm2

vfmadd231sd %xmm5, %xmm3, %xmm0

vdivsd %xmm4, %xmm0, %xmm0

vsubsd %xmm2, %xmm6, %xmm2

vmulsd .LC6(%rip), %xmm2, %xmm2

vfmadd213sd 88(%rsp), %xmm0, %xmm3

vfmadd213sd 64(%rsp), %xmm0, %xmm4

vmovddup %xmm0, %xmm1

vfmadd213pd 72(%rsp), %xmm1, %xmm7

vmovddup %xmm2, %xmm1

vfmadd231sd %xmm5, %xmm2, %xmm3

vaddsd %xmm6, %xmm2, %xmm2

vmovsd %xmm4, 64(%rsp)

vfmadd213sd 96(%rsp), %xmm2, %xmm0

vfmadd231pd %xmm1, %xmm8, %xmm7

vmovsd %xmm3, 88(%rsp)

vmovsd %xmm0, 96(%rsp)

vmovups %xmm7, 72(%rsp)

cmpl %ebx, %r12d

jl .L91

If there's anything else I could try to pin-point causes, I'm all ears.. :)

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

Von: Gael Guennebaud [gael.guennebaud@xxxxxxxxx]
Gesendet: Donnerstag, 2. August 2018 0:10
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):

// 3.2: 7.75s

// 3.3: 7.68s

// 3.2 -march=native: 7.46s

// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were about 0.2s.

gael

On Wed, Aug 1, 2018 at 1:28 PM <Daniel.Vollmer@xxxxxx> wrote:

Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Marc Glisse [marc.glisse@xxxxxxxx]
Gesendet: Mittwoch, 1. August 2018 12:06
An: eigen@xxxxxxxxxxxxxxxxxxx
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, Daniel.Vollmer@xxxxxx wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse