[eigen] Vectorization of complex

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi all,

I experimented a bit with vectorizing complex multiplication and am quite surprised by what I found:

First of all, I tried to implement complex multiplication using the SSE4_1 command intrinsics _mm_dp_pd and _mm_blend_pd.

I came up with the following implementation in Eigen/src/Core/arch/SSE/Complex.h:

template<> EIGEN_STRONG_INLINE Packet1cd pmul<Packet1cd>(const Packet1cd& a, const Packet1cd& b)
{
  #ifdef EIGEN_VECTORIZE_SSE4_1

const __m128d mask = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));

return Packet1cd( _mm_blend_pd( _mm_dp_pd( _mm_xor_pd( a.v,mask ), b.v, 0xF1),
                _mm_dp_pd(vec2d_swizzle1(a.v,1,0),b.v,0xF2),
                0x02) );
  #else
  #ifdef EIGEN_VECTORIZE_SSE3
return Packet1cd(_mm_addsub_pd(_mm_mul_pd(vec2d_swizzle1(a.v, 0, 0), b.v),
                                 _mm_mul_pd(vec2d_swizzle1(a.v, 1, 1),
                                            vec2d_swizzle1(b.v, 1, 0))));
  #else
const __m128d mask = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
  return Packet1cd(_mm_add_pd(_mm_mul_pd(vec2d_swizzle1(a.v, 0, 0), b.v),
_mm_xor_pd(_mm_mul_pd(vec2d_swizzle1(a.v, 1, 1), vec2d_swizzle1(b.v, 1, 0)), mask)));
  #endif // SSE3
  #endif // SSE4_2
}



I then started testing the code and realized that unfortunately my implementation is a bit slower than the SSE2 version. Even more puzzling: Actually, the already existing SSE3 implementation is ALSO SLOWER than the SSE2 code! Does anybody have an idea, why my SSE4_1 code is even slower than the SSE3 code?

By the way, we are talking about something like 1 percent run time difference in my tests, but still if the SSE3 and SSE4 codes are not really faster than SSE2, I think they should be removed...

I only tested this for complex matrix matrix products of
Eigen::Matrix<std::complex<float>, Eigen::Dynamic, Eigen::Dynamic> and
Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic>, so maybe I missed something and there are cases where the SSE3 code is usefull.

Greetings,
David Luitz




Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/