AW: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hello Gael,

if I only use SSE2 (that is without -march=native on a Haswell Xeon with AVX & FMA), then I also see no difference in the benchmark. The same is true for "-msse42" as well.

But if I use "-march=native" (which then enables AVX and FMA for Eigen-3.3) for *this particular example*, I see this ~13% slowdown. Can you confirm this?

For reference, the compilation command used was
g++ eigen_bench2.cpp -std=c++11 -Ofast -fno-finite-math-only -DNDEBUG -march=native -I eigen-3.2.9 / 3.3.5

I am no x86 assembly nor vectorization expert, so take the following with a lot of salt.

I have attached the generate assembly for the "hot loop". The first 49 lines are nearly the same, except for eigen-3.2 using %rbp-relative addressing and 3.3 using %rsp (with different offsets).

The remainder is more distinct. Eigen-3.3 doesn't use AVX registers as far as I can see, but uses more "...packed-double" instructions (Eigen 3.2 assembly doesn't seem to use any), but it seems the sequence is still (slightly) slower overall for Eigen-3.3.

eigen-3.2:
vmovsd -168(%rbp), %xmm3
vxorpd %xmm6, %xmm6, %xmm6
vmovsd .LC4(%rip), %xmm2
vmovsd -136(%rbp), %xmm8
vcvtsi2sd %eax, %xmm6, %xmm6
vfmadd132sd .LC3(%rip), %xmm2, %xmm6
vmulsd %xmm3, %xmm3, %xmm0
vmovsd -160(%rbp), %xmm2
vmovsd -176(%rbp), %xmm4
vmulsd %xmm8, %xmm3, %xmm1
vmovsd -144(%rbp), %xmm9
vmovsd -184(%rbp), %xmm5
vmovsd -152(%rbp), %xmm7
vfmadd231sd %xmm2, %xmm2, %xmm0
vfmadd231sd %xmm6, %xmm2, %xmm1
vfmadd231sd %xmm4, %xmm4, %xmm0
vmulsd .LC5(%rip), %xmm0, %xmm0
vfmadd231sd %xmm4, %xmm9, %xmm1
vdivsd %xmm5, %xmm0, %xmm0
vdivsd %xmm5, %xmm1, %xmm1
vsubsd %xmm0, %xmm7, %xmm0
vmulsd .LC6(%rip), %xmm0, %xmm0
vfmadd213sd -104(%rbp), %xmm1, %xmm4
vfmadd213sd -96(%rbp), %xmm1, %xmm3
vfmadd213sd -88(%rbp), %xmm1, %xmm2
vfmadd213sd -112(%rbp), %xmm1, %xmm5
vfmadd231sd %xmm9, %xmm0, %xmm4
vfmadd231sd %xmm8, %xmm0, %xmm3
vfmadd231sd %xmm6, %xmm0, %xmm2
vaddsd %xmm7, %xmm0, %xmm0
vmovsd %xmm5, -112(%rbp)
vfmadd213sd -80(%rbp), %xmm0, %xmm1
vmovsd %xmm4, -104(%rbp)
vmovsd %xmm3, -96(%rbp)
vmovsd %xmm2, -88(%rbp)
vmovsd %xmm1, -80(%rbp)
cmpl %ebx, %r13d
jl .L91

eigen-3.3:
vmovupd 120(%rsp), %xmm7
vmovapd 32(%rsp), %xmm8
vxorpd %xmm5, %xmm5, %xmm5
vcvtsi2sd %eax, %xmm5, %xmm5
vmovsd .LC4(%rip), %xmm4
vfmadd132sd .LC3(%rip), %xmm4, %xmm5
vmulpd %xmm7, %xmm7, %xmm1
vmovsd 16(%rsp), %xmm3
vmovsd 24(%rsp), %xmm4
vmovsd 8(%rsp), %xmm6
vunpckhpd %xmm1, %xmm1, %xmm2
vaddsd %xmm2, %xmm1, %xmm2
vmulpd %xmm8, %xmm7, %xmm1
vfmadd231sd %xmm3, %xmm3, %xmm2
vmulsd .LC5(%rip), %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm0
vaddsd %xmm0, %xmm1, %xmm0
vdivsd %xmm4, %xmm2, %xmm2
vfmadd231sd %xmm5, %xmm3, %xmm0
vdivsd %xmm4, %xmm0, %xmm0
vsubsd %xmm2, %xmm6, %xmm2
vmulsd .LC6(%rip), %xmm2, %xmm2
vfmadd213sd 88(%rsp), %xmm0, %xmm3
vfmadd213sd 64(%rsp), %xmm0, %xmm4
vmovddup %xmm0, %xmm1
vfmadd213pd 72(%rsp), %xmm1, %xmm7
vmovddup %xmm2, %xmm1
vfmadd231sd %xmm5, %xmm2, %xmm3
vaddsd %xmm6, %xmm2, %xmm2
vmovsd %xmm4, 64(%rsp)
vfmadd213sd 96(%rsp), %xmm2, %xmm0
vfmadd231pd %xmm1, %xmm8, %xmm7
vmovsd %xmm3, 88(%rsp)
vmovsd %xmm0, 96(%rsp)
vmovups %xmm7, 72(%rsp)
cmpl %ebx, %r12d
jl .L91



If there's anything else I could try to pin-point causes, I'm all ears.. :)


Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

Von: Gael Guennebaud [gael.guennebaud@xxxxxxxxx]
Gesendet: Donnerstag, 2. August 2018 0:10
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)


Hi,

I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):

// 3.2: 7.75s
// 3.3: 7.68s

// 3.2 -march=native: 7.46s
// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were about 0.2s.

gael


On Wed, Aug 1, 2018 at 1:28 PM <Daniel.Vollmer@xxxxxx> wrote:
Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.


Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Marc Glisse [marc.glisse@xxxxxxxx]
Gesendet: Mittwoch, 1. August 2018 12:06
An: eigen@xxxxxxxxxxxxxxxxxxx
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, Daniel.Vollmer@xxxxxx wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse


Attachment: eigen_32.S
Description: eigen_32.S

Attachment: eigen_33.S
Description: eigen_33.S



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/