Re: [eigen] Issues regarding Quaternion-alignment and const Maps |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
Here comes the patch,
genericity (is that a word?) is based on Gael's suggestions (I just
replaced ei_por by ei_pxor, and used the same mask for both addsub
replacements).
I was surprised at first that I only get a speedup of about ~1.6x
against non-vectorized version, but then found out that originally
-msse2 was actually slower than the version without vectorization.
Anyways, now -msse2 and -msse3 both run faster than just -O2 (on my
Core2 Duo).
On my (rather archaic) Athlon64 the current sse2 version (it does not
support sse3) is *slower* than just using -O2 :(
Which, by the way, reminds me of the original topic of this thread ;)
Christoph
Gael Guennebaud schrieb:
> here a variant using at most as possible generic code:
>
> const __m128d mask1 = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
> const __m128d mask2 = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));
>
> Quaternion<double> res;
>
> typedef ei_packet_traits<double>::type Packet;
>
> const double* a = _a.coeffs().data();
> Packet b_xy = _b.coeffs().template packet<Aligned>(0);
> Packet b_zw = _b.coeffs().template packet<Aligned>(2);
> Packet a_xx = ei_pset1(a[0]);
> Packet a_yy = ei_pset1(a[1]);
> Packet a_zz = ei_pset1(a[2]);
> Packet a_ww = ei_pset1(a[3]);
> Packet t1, t2;
>
> t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
> t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
>
> #ifdef __SSE3__
> ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
> #else
> ei_pstore(&res.x(), ei_padd(t1, ei_por(mask1,ei_preverse(t2))));
> #endif
>
> t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
> t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
> #ifdef __SSE3__
> ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
> #else
> ei_pstore(&res.z(), ei_padd(t1, ei_por(mask2,ei_preverse(t2))));
> #endif
>
> return res;
>
> Actually, my recent work on the vectorization of complexes, and this
> code, let me thought that it would be a good idea to add ei_paddsub
> and ei_psubadd functions such that we could write generic vectorized
> code for complex and quaternions (generic in the sense it would work
> for all vector engine).
>
> Here is how I see it. For instance let's take the example of the
> quaternion multiplication. We could have a generic
>
> template<typename Quat> Quat ei_quatmul(Quat& a, Quat& b);
>
> function calling a ei_quatmul_selector which would be specialized for
> the 3 following configurations:
>
> 1 - ei_packet_traits<Quat::Scalar>::size == 2 => the above code
> 2 - ei_packet_traits<Quat::Scalar>::size == 4 => the code we already
> have but written in a generic way
> 3 - otherwise => scalar path
>
> And we should make sure that one can specialize this function for a
> given scalar type/vector engine in the case some specific
> optimizations can be done.
>
> gael.
>
> On Sat, Jul 10, 2010 at 1:11 AM, Christoph Hertzberg
> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> Benoit Jacob wrote:
>>> I have made a patch letting ei_pset1 use _mm_loaddup_pd when we have SSE3:
>>>
>>> template<> EIGEN_STRONG_INLINE Packet2d ei_pset1<double>(const double&
>>> from) {
>>> #ifdef EIGEN_VECTORIZE_SSE3
>>> return _mm_loaddup_pd(&from);
>>> #else
>>> Packet2d res = _mm_set_sd(from);
>>> return ei_vec2d_swizzle1(res, 0, 0);
>>> #endif
>>> }
>>>
>>> But guess what? It's actually not faster (perhaps even a bit slower)
>>> than our ei_vec2d_swizzle1!
>>>
>>> So let's just forget about it.
>>>
>>> Christoph, is _mm_loaddup_pd the only SSE3 intrinsic your code is
>>> using ? If yes, by using ei_pset1 instead of _mm_loaddup_pd, you can
>>> make your code work on SSE2 !
>> I guess the most important SSE3 instruction is _mm_addsub_pd which adds the
>> first and subtracts the second element. If there is a code which negates
>> just one element, this could be replaced.
>>
>> Googleing a bit implies that the SSE-way to do it is to XOR with
>> {-0.0, 0.0} (or the other way around). I will try that ...
>>
>> Christoph
--
----------------------------------------------
Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen
Tel: (+49) 421-218-64252
----------------------------------------------
# HG changeset patch
# User Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>
# Date 1278970247 -7200
# Node ID eeff20a8804b7cdbf34f8be7c80b69993d1f0dc8
# Parent a9c28fb21cba3cdf81b1964d2510afc1682569eb
Implemented SSE optimized double-precision Quaternion multiplication
diff -r a9c28fb21cba -r eeff20a8804b Eigen/src/Geometry/arch/Geometry_SSE.h
--- a/Eigen/src/Geometry/arch/Geometry_SSE.h Sun Jul 11 11:01:17 2010 +0200
+++ b/Eigen/src/Geometry/arch/Geometry_SSE.h Mon Jul 12 23:30:47 2010 +0200
@@ -64,4 +64,58 @@
}
};
+
+
+
+template<class Derived, class OtherDerived>
+struct ei_quat_product<Architecture::SSE, Derived, OtherDerived, double, Aligned>
+{
+ inline static Quaternion<double> run(const QuaternionBase<Derived>& _a, const QuaternionBase<OtherDerived>& _b)
+ {
+ const Packet2d mask = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
+
+ Quaternion<double> res;
+
+ const double* a = _a.coeffs().data();
+ Packet2d b_xy = _b.coeffs().template packet<Aligned>(0);
+ Packet2d b_zw = _b.coeffs().template packet<Aligned>(2);
+ Packet2d a_xx = ei_pset1(a[0]);
+ Packet2d a_yy = ei_pset1(a[1]);
+ Packet2d a_zz = ei_pset1(a[2]);
+ Packet2d a_ww = ei_pset1(a[3]);
+
+ // two temporaries:
+ Packet2d t1, t2;
+
+ /*
+ * t1 = ww*xy + yy*zw
+ * t2 = zz*xy - xx*zw
+ * res.xy = t1 +/- swap(t2)
+ */
+ t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
+ t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
+#ifdef __SSE3__
+ ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
+#else
+ ei_pstore(&res.x(), ei_padd(t1, ei_pxor(mask,ei_preverse(t2))));
+#endif
+
+ /*
+ * t1 = ww*zw - yy*xy
+ * t2 = zz*zw + xx*xy
+ * res.zw = t1 -/+ swap(t2) = swap( swap(t1) +/- t2)
+ */
+ t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
+ t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
+#ifdef __SSE3__
+ ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
+#else
+ ei_pstore(&res.z(), ei_psub(t1, ei_pxor(mask,ei_preverse(t2))));
+#endif
+
+ return res;
+}
+};
+
+
#endif // EIGEN_GEOMETRY_SSE_H