RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hello again,

I've been trying to understand a bit better what is happening with the performance regression I'm seeing, and at the moment I am under the impression that Eigen-3.3 makes it harder (impossible?) for gcc to recognize when no aliasing is happening.

I've further reduced my original example to essentially the following loop  (see eigen_bench3.cpp for a self-contained version).
  using Vec          = Eigen::Matrix<double, 2, 1>;
  Vec sum = Vec::Zero();
  for (int i = 0; i < num; ++i)
  {
    const Vec dirA = sum;
    const Vec dirB = dirA;

    sum += dirA.dot(dirB) * dirA;
  }

When using Eigen-3.3, gcc-8 always spills back to memory when evaluating the expression, whereas when using Eigen-3.2 the loop runs completely without spilling. As there should be enough registers available, I assume this is due to "fear of aliasing".

This is independent of vectorization and all the other nice things, it happens with vectorization disabled as well.

Interestingly (?), when I change the size of Vec (from 2) to 1, then the generated code is identical for 3.3 and 3.2.

The following is the x86-64 assembly for the relevant loop (only using -O1 and disabled vectorization because of increased clarity; the problem remains for higher optimization levels):

Eigen-3.3:
# eigen_bench3.cpp:18:   EIGEN_ASM_COMMENT("begin loop");
	test	ebx, ebx	# iftmp.0_3
	jle	.L131	#,
# eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
	mov	edx, 0	# i,
..L132:
# eigen-3.3.5/Eigen/src/Core/DenseStorage.h:194:     DenseStorage(const DenseStorage& other) : m_data(other.m_data) {
	mov	rsi, QWORD PTR [rsp+32]	# MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
	mov	rdi, QWORD PTR [rsp+40]	# MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
	mov	QWORD PTR [rsp], rsi	# MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
	mov	QWORD PTR [rsp+8], rdi	# MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
	mov	QWORD PTR [rsp+16], rsi	# MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
	mov	QWORD PTR [rsp+24], rdi	# MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171:         const Packet& b) { return a*b; }
	vmovsd	xmm0, QWORD PTR [rsp+8]	# _26, MEM[(const double &)&dirA + 8]
	vmovsd	xmm1, QWORD PTR [rsp]	# _28, MEM[(const double &)&dirA]
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171:         const Packet& b) { return a*b; }
	vmulsd	xmm2, xmm0, QWORD PTR [rsp+24]	# tmp123, _26, MEM[(struct plain_array *)&dirB + 8B]
	vmulsd	xmm3, xmm1, QWORD PTR [rsp+16]	# tmp124, _28, MEM[(struct plain_array *)&dirB]
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:42:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a + b; }
	vaddsd	xmm2, xmm2, xmm3	# _30, tmp123, tmp124
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
	vmulsd	xmm1, xmm1, xmm2	# tmp125, _28, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
	vaddsd	xmm1, xmm1, QWORD PTR [rsp+32]	# tmp126, tmp125, MEM[(double &)&sum]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
	vmovsd	QWORD PTR [rsp+32], xmm1	# MEM[(double &)&sum], tmp126
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
	vmulsd	xmm0, xmm0, xmm2	# tmp128, _26, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
	vaddsd	xmm0, xmm0, QWORD PTR [rsp+40]	# tmp129, tmp128, MEM[(double &)&sum + 8]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
	vmovsd	QWORD PTR [rsp+40], xmm0	# MEM[(double &)&sum + 8], tmp129
# eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
	inc	edx	# i
# eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
	cmp	ebx, edx	# iftmp.0_3, i
	jne	.L132	#,
..L131:
# eigen_bench3.cpp:28:   EIGEN_ASM_COMMENT("end   loop");




And the same for Eigen-3.2

# eigen_bench3.cpp:18:   EIGEN_ASM_COMMENT("begin loop");
	test	ebx, ebx	# iftmp.0_3
	jle	.L131	#,
	vmovsd	xmm2, QWORD PTR [rsp+16]	# sum__lsm.104, MEM[(const Scalar &)&sum]
	vmovsd	xmm1, QWORD PTR [rsp+24]	# sum__lsm.105, MEM[(const Scalar &)&sum + 8]
# eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
	mov	edx, 0	# i,
	vmovsd	xmm4, QWORD PTR .LC6[rip]	# tmp117,
..L132:
# eigen-3.2.9-mod/Eigen/src/Core/GenericPacketMath.h:114:         const Packet& b) { return a*b; }
	vmulsd	xmm0, xmm1, xmm1	# tmp114, sum__lsm.105, sum__lsm.105
	vmulsd	xmm3, xmm2, xmm2	# tmp115, sum__lsm.104, sum__lsm.104
# eigen-3.2.9-mod/Eigen/src/Core/Functors.h:26:   EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a, const Scalar& b) const { return a + b; }
	vaddsd	xmm0, xmm0, xmm3	# tmp116, tmp114, tmp115
	vaddsd	xmm0, xmm0, xmm4	# _52, tmp116, tmp117
	vmulsd	xmm2, xmm2, xmm0	# sum__lsm.104, sum__lsm.104, _52
	vmulsd	xmm1, xmm1, xmm0	# sum__lsm.105, sum__lsm.105, _52
# eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
	inc	edx	# i
# eigen_bench3.cpp:19:   for (int i = 0; i < num; ++i)
	cmp	ebx, edx	# iftmp.0_3, i
	jne	.L132	#,
	vmovsd	QWORD PTR [rsp+16], xmm2	# MEM[(const Scalar &)&sum], sum__lsm.104
	vmovsd	QWORD PTR [rsp+24], xmm1	# MEM[(const Scalar &)&sum + 8], sum__lsm.105
..L131:
# eigen_bench3.cpp:28:   EIGEN_ASM_COMMENT("end   loop");


Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de
________________________________________
Von: Gael Guennebaud [gael.guennebaud@xxxxxxxxx]
Gesendet: Donnerstag, 2. August 2018 0:10
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):

// 3.2: 7.75s
// 3.3: 7.68s

// 3.2 -march=native: 7.46s
// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were about 0.2s.

gael


On Wed, Aug 1, 2018 at 1:28 PM <Daniel.Vollmer@xxxxxx> wrote:
Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.


Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Marc Glisse [marc.glisse@xxxxxxxx]
Gesendet: Mittwoch, 1. August 2018 12:06
An: eigen@xxxxxxxxxxxxxxxxxxx
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, Daniel.Vollmer@xxxxxx wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse

Attachment: eigen_bench3.cpp
Description: eigen_bench3.cpp

Attachment: eigen_32.S
Description: eigen_32.S

Attachment: eigen_33.S
Description: eigen_33.S



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/