Re: [eigen] Non-optimal sse assembly code with gcc

[ Thread Index | Date Index | More Archives ]

On 22.01.2012 17:32, Benjamin Schindler wrote:

I just had a close look at the assembly generated by the following

bool particleCheckSpheric(Eigen::AlignedVector3<float> pos1,
Eigen::AlignedVector3<float> pos2, float particleSize)
return particleSize*particleSize > (pos1-pos2).squaredNorm();

The assembly I got is the following (compiled on an amd64 machine using
gcc 4.5.3, with -O2 -DNDEBUG):

01: movaps (%rdi), %xmm1
03: mulss %xmm0, %xmm0
04: subps (%rsi), %xmm1
05: mulps %xmm1, %xmm1
06: movaps %xmm1, %xmm2
07: movhlps %xmm1, %xmm2
08: addps %xmm1, %xmm2
09: movaps %xmm2, %xmm1
10: shufps $0x1, %xmm2, %xmm1
11: addss %xmm1, %xmm2
12: ucomiss %xmm2, %xmm0
13: seta %al
14: retq

Notice line 6 (and 9): It seems to me that these copies are unnecessary
as only the low quadword is really used. Is this a problem of the
compiler is this an eigen issue?

Side-note: I guess, if you activate SSE3 line 6 to 11 will be replaced by just two haddps %xmm1, %xmm1

And I think gcc does everything correct, as the Eigen source (w/o SSE3) says:

template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
  Packet4f tmp = _mm_add_ps(a, _mm_movehl_ps(a,a));
  return pfirst(_mm_add_ss(tmp, _mm_shuffle_ps(tmp,tmp, 1)));

So _mm_movehl_ps(a,a) actually requires that the upper half is copied from a (i.e. xmm1). And I guess it can make a difference, because if the upper half of xmm2 happens to contain a denormalized number, the addps instruction might be slower on some hardware (not sure about that, though).


Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen

Tel: (+49) 421-218-64252

Mail converted by MHonArc 2.6.19+