Re: [eigen] Non-optimal sse assembly code with gcc |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
On 22.01.2012 17:32, Benjamin Schindler wrote:
Hi
I just had a close look at the assembly generated by the following
function:
bool particleCheckSpheric(Eigen::AlignedVector3<float> pos1,
Eigen::AlignedVector3<float> pos2, float particleSize)
{
return particleSize*particleSize > (pos1-pos2).squaredNorm();
}
The assembly I got is the following (compiled on an amd64 machine using
gcc 4.5.3, with -O2 -DNDEBUG):
01: movaps (%rdi), %xmm1
03: mulss %xmm0, %xmm0
04: subps (%rsi), %xmm1
05: mulps %xmm1, %xmm1
06: movaps %xmm1, %xmm2
07: movhlps %xmm1, %xmm2
08: addps %xmm1, %xmm2
09: movaps %xmm2, %xmm1
10: shufps $0x1, %xmm2, %xmm1
11: addss %xmm1, %xmm2
12: ucomiss %xmm2, %xmm0
13: seta %al
14: retq
Notice line 6 (and 9): It seems to me that these copies are unnecessary
as only the low quadword is really used. Is this a problem of the
compiler is this an eigen issue?
Side-note: I guess, if you activate SSE3 line 6 to 11 will be replaced
by just two haddps %xmm1, %xmm1
And I think gcc does everything correct, as the Eigen source (w/o SSE3)
says:
template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
{
Packet4f tmp = _mm_add_ps(a, _mm_movehl_ps(a,a));
return pfirst(_mm_add_ss(tmp, _mm_shuffle_ps(tmp,tmp, 1)));
}
So _mm_movehl_ps(a,a) actually requires that the upper half is copied
from a (i.e. xmm1). And I guess it can make a difference, because if the
upper half of xmm2 happens to contain a denormalized number, the addps
instruction might be slower on some hardware (not sure about that, though).
Christoph
--
----------------------------------------------
Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen
Tel: (+49) 421-218-64252
----------------------------------------------