RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
From: Marc Glisse <marc.glisse@xxxxxxxx>
Date: Mon, 6 Aug 2018 19:19:36 +0200 (CEST)

On Mon, 6 Aug 2018, Daniel.Vollmer@xxxxxx wrote:

I've been trying to understand a bit better what is happening with theperformance regression I'm seeing, and at the moment I am under theimpression that Eigen-3.3 makes it harder (impossible?) for gcc torecognize when no aliasing is happening.


Nah, it is just gcc being silly.

I've further reduced my original example to essentially the following loop  (see eigen_bench3.cpp for a self-contained version).
 using Vec          = Eigen::Matrix<double, 2, 1>;
 Vec sum = Vec::Zero();
 for (int i = 0; i < num; ++i)
 {
   const Vec dirA = sum;
   const Vec dirB = dirA;

   sum += dirA.dot(dirB) * dirA;
 }


Without vectors, the main loop at -O3 starts with

        movdqu  (%rax), %xmm0
        addl    $1, %edx
        movaps  %xmm0, -40(%rsp)
        movsd   -40(%rsp), %xmm1
        movsd   -32(%rsp), %xmm4
        movaps  %xmm0, -24(%rsp)
        movsd   -16(%rsp), %xmm0
        movsd   -24(%rsp), %xmm5

so: read from memory, write to memory and re-read piecewise, and do it asecond time just for the sake of it.

The corresponding internal representation at the end of the high-leveloptimization phase is


  MEM[(struct DenseStorage *)&dirA].m_data = MEM[(const struct DenseStorage &)sum_5(D)].m_data;
  dirA_31 = MEM[(struct plain_array *)&dirA];
  dirA$8_30 = MEM[(struct plain_array *)&dirA + 8B];
  MEM[(struct DenseStorage *)&dirB].m_data = MEM[(const struct DenseStorage &)&dirA].m_data;
  dirB_37 = MEM[(struct plain_array *)&dirB];
  dirB$8_38 = MEM[(struct plain_array *)&dirB + 8B];

This involves some direct mem-to-mem assignments, which is something thatgcc handles super badly. If the copy was done piecewise, each elementwould be a SSA variable and optimizations would work. Even if the copy wasdone with memcpy there would be code to simplify it. But mem-to-mem...


I strongly encourage you to report this testcase to gcc's bugzilla.

(it doesn't mean that people can't work around it in eigen somehow, butthat will likely not be nice and not catch all cases)


--
Marc Glisse

References:
- Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Daniel.Vollmer
- Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Marc Glisse
- RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Daniel.Vollmer
- Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Gael Guennebaud
- RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
  - From: Daniel.Vollmer

Messages sorted by: [ date | thread ]
Prev by Date: RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
Next by Date: [eigen] Why does Eigen::Quaterniond::angularDistance() method return the absolute angle?
Previous by thread: RE: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
Next by thread: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/