[eigen] Vectorization of complex |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: [eigen] Vectorization of complex
- From: David Luitz <tux008@xxxxxxxxxxxxxx>
- Date: Thu, 20 Jan 2011 23:11:56 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=5aR3ONdFoCSY1srt2bAYwg3S5AHc3AxvgA0wUX/tPQo=; b=V4E8HYkfYWkhNaAomPt9z/8/oWwRV6QdTmPF/46ECXRIkX2Uoz/Z1Zg6SQcpZPhI4w mRpFJc0+t7LTjlwfqW5sr7wKMZbql5PUjTzmvf0oycYnojp4pD5iUeVWVW/O5Wk1R0mw +fCYgIzJJ/x9l+galysFw1nn3n4ODTStsbT5E=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=Mku95+2XOXAZlmp7IGZmEMROjCTdkRot5y91DKvAIqhvR9aVWCWNbGmhxiskBwByQW RWAOzzEsMMy4l2snitVtuCdN8o21sqxoOASWMhCG69eAvkycbkhHzkHZZGvHXv5xmWdP gNV6sGsI18zr2shsoYG/uXTo55X1fku5A1o30=
Hi all,
I experimented a bit with vectorizing complex multiplication and am
quite surprised by what I found:
First of all, I tried to implement complex multiplication using the
SSE4_1 command intrinsics _mm_dp_pd and _mm_blend_pd.
I came up with the following implementation in
Eigen/src/Core/arch/SSE/Complex.h:
template<> EIGEN_STRONG_INLINE Packet1cd pmul<Packet1cd>(const
Packet1cd& a, const Packet1cd& b)
{
#ifdef EIGEN_VECTORIZE_SSE4_1
const __m128d mask =
_mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));
return Packet1cd( _mm_blend_pd( _mm_dp_pd( _mm_xor_pd( a.v,mask ),
b.v, 0xF1),
_mm_dp_pd(vec2d_swizzle1(a.v,1,0),b.v,0xF2),
0x02) );
#else
#ifdef EIGEN_VECTORIZE_SSE3
return Packet1cd(_mm_addsub_pd(_mm_mul_pd(vec2d_swizzle1(a.v, 0, 0),
b.v),
_mm_mul_pd(vec2d_swizzle1(a.v, 1, 1),
vec2d_swizzle1(b.v, 1, 0))));
#else
const __m128d mask =
_mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
return Packet1cd(_mm_add_pd(_mm_mul_pd(vec2d_swizzle1(a.v, 0, 0), b.v),
_mm_xor_pd(_mm_mul_pd(vec2d_swizzle1(a.v,
1, 1),
vec2d_swizzle1(b.v,
1, 0)), mask)));
#endif // SSE3
#endif // SSE4_2
}
I then started testing the code and realized that unfortunately my
implementation is a bit slower than the SSE2 version. Even more
puzzling: Actually, the already existing SSE3 implementation is ALSO
SLOWER than the SSE2 code! Does anybody have an idea, why my SSE4_1 code
is even slower than the SSE3 code?
By the way, we are talking about something like 1 percent run time
difference in my tests, but still if the SSE3 and SSE4 codes are not
really faster than SSE2, I think they should be removed...
I only tested this for complex matrix matrix products of
Eigen::Matrix<std::complex<float>, Eigen::Dynamic, Eigen::Dynamic> and
Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic>, so
maybe I missed something and there are cases where the SSE3 code is usefull.
Greetings,
David Luitz