Re: [eigen] Issues regarding Quaternion-alignment and const Maps

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Here comes the patch,

genericity (is that a word?) is based on Gael's suggestions (I just
replaced ei_por by ei_pxor, and used the same mask for both addsub
replacements).

I was surprised at first that I only get a speedup of about ~1.6x
against non-vectorized version, but then found out that originally
-msse2 was actually slower than the version without vectorization.
Anyways, now -msse2 and -msse3 both run faster than just -O2 (on my
Core2 Duo).
On my (rather archaic) Athlon64 the current sse2 version (it does not
support sse3) is *slower* than just using -O2  :(
Which, by the way, reminds me of the original topic of this thread ;)

Christoph


Gael Guennebaud schrieb:
> here a variant using at most as possible generic code:
> 
> const __m128d mask1  = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
> const __m128d mask2 = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));
> 
> Quaternion<double> res;
> 
> typedef ei_packet_traits<double>::type Packet;
> 
> const double* a = _a.coeffs().data();
> Packet b_xy = _b.coeffs().template packet<Aligned>(0);
> Packet b_zw = _b.coeffs().template packet<Aligned>(2);
> Packet a_xx = ei_pset1(a[0]);
> Packet a_yy = ei_pset1(a[1]);
> Packet a_zz = ei_pset1(a[2]);
> Packet a_ww = ei_pset1(a[3]);
> Packet t1, t2;
> 
> t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
> t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
> 
> #ifdef __SSE3__
> ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
> #else
> ei_pstore(&res.x(), ei_padd(t1, ei_por(mask1,ei_preverse(t2))));
> #endif
> 
> t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
> t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
> #ifdef __SSE3__
> ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
> #else
> ei_pstore(&res.z(), ei_padd(t1, ei_por(mask2,ei_preverse(t2))));
> #endif
> 
> return res;
> 
> Actually, my recent work on the vectorization of complexes, and this
> code, let me thought that it would be a good idea to add ei_paddsub
> and ei_psubadd functions such that we could write generic vectorized
> code for complex and quaternions (generic in the sense it would work
> for all vector engine).
> 
> Here is how I see it. For instance let's take the example of the
> quaternion multiplication. We could have a generic
> 
> template<typename Quat> Quat ei_quatmul(Quat& a, Quat& b);
> 
> function calling a ei_quatmul_selector which would be specialized for
> the 3 following configurations:
> 
> 1 - ei_packet_traits<Quat::Scalar>::size == 2 => the above code
> 2 - ei_packet_traits<Quat::Scalar>::size == 4 => the code we already
> have but written in a generic way
> 3 - otherwise => scalar path
> 
> And we should make sure that one can specialize this function for a
> given scalar type/vector engine in the case some specific
> optimizations can be done.
> 
> gael.
> 
> On Sat, Jul 10, 2010 at 1:11 AM, Christoph Hertzberg
> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> Benoit Jacob wrote:
>>> I have made a patch letting ei_pset1 use _mm_loaddup_pd when we have SSE3:
>>>
>>> template<> EIGEN_STRONG_INLINE Packet2d ei_pset1<double>(const double&
>>>  from) {
>>> #ifdef EIGEN_VECTORIZE_SSE3
>>>  return _mm_loaddup_pd(&from);
>>> #else
>>>  Packet2d res = _mm_set_sd(from);
>>>  return ei_vec2d_swizzle1(res, 0, 0);
>>> #endif
>>> }
>>>
>>> But guess what? It's actually not faster (perhaps even a bit slower)
>>> than our ei_vec2d_swizzle1!
>>>
>>> So let's just forget about it.
>>>
>>> Christoph, is  _mm_loaddup_pd the only SSE3 intrinsic your code is
>>> using ? If yes, by using ei_pset1 instead of _mm_loaddup_pd, you can
>>> make your code work on SSE2 !
>> I guess the most important SSE3 instruction is _mm_addsub_pd which adds the
>> first and subtracts the second element. If there is a code which negates
>> just one element, this could be replaced.
>>
>> Googleing a bit implies that the SSE-way to do it is to XOR with
>> {-0.0, 0.0} (or the other way around). I will try that ...
>>
>> Christoph


-- 
----------------------------------------------
Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen

Tel: (+49) 421-218-64252
----------------------------------------------
# HG changeset patch
# User Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>
# Date 1278970247 -7200
# Node ID eeff20a8804b7cdbf34f8be7c80b69993d1f0dc8
# Parent  a9c28fb21cba3cdf81b1964d2510afc1682569eb
Implemented SSE optimized double-precision Quaternion multiplication

diff -r a9c28fb21cba -r eeff20a8804b Eigen/src/Geometry/arch/Geometry_SSE.h
--- a/Eigen/src/Geometry/arch/Geometry_SSE.h	Sun Jul 11 11:01:17 2010 +0200
+++ b/Eigen/src/Geometry/arch/Geometry_SSE.h	Mon Jul 12 23:30:47 2010 +0200
@@ -64,4 +64,58 @@
   }
 };
 
+
+
+
+template<class Derived, class OtherDerived>
+struct ei_quat_product<Architecture::SSE, Derived, OtherDerived, double, Aligned>
+{
+  inline static Quaternion<double> run(const QuaternionBase<Derived>& _a, const QuaternionBase<OtherDerived>& _b)
+  {
+  const Packet2d mask = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
+
+  Quaternion<double> res;
+
+  const double* a = _a.coeffs().data();
+  Packet2d b_xy = _b.coeffs().template packet<Aligned>(0);
+  Packet2d b_zw = _b.coeffs().template packet<Aligned>(2);
+  Packet2d a_xx = ei_pset1(a[0]);
+  Packet2d a_yy = ei_pset1(a[1]);
+  Packet2d a_zz = ei_pset1(a[2]);
+  Packet2d a_ww = ei_pset1(a[3]);
+
+  // two temporaries:
+  Packet2d t1, t2;
+
+  /*
+   * t1 = ww*xy + yy*zw
+   * t2 = zz*xy - xx*zw
+   * res.xy = t1 +/- swap(t2)
+   */
+  t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
+  t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
+#ifdef __SSE3__
+  ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
+#else
+  ei_pstore(&res.x(), ei_padd(t1, ei_pxor(mask,ei_preverse(t2))));
+#endif
+  
+  /*
+   * t1 = ww*zw - yy*xy
+   * t2 = zz*zw + xx*xy
+   * res.zw = t1 -/+ swap(t2) = swap( swap(t1) +/- t2)
+   */
+  t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
+  t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
+#ifdef __SSE3__
+  ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
+#else
+  ei_pstore(&res.z(), ei_psub(t1, ei_pxor(mask,ei_preverse(t2))));
+#endif
+
+  return res;
+}
+};
+
+
 #endif // EIGEN_GEOMETRY_SSE_H


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/