Re: [eigen] Issues regarding Quaternion-alignment and const Maps

[ Thread Index | Date Index | More Archives ]

here a variant using at most as possible generic code:

const __m128d mask1  = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
const __m128d mask2 = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));

Quaternion<double> res;

typedef ei_packet_traits<double>::type Packet;

const double* a = _a.coeffs().data();
Packet b_xy = _b.coeffs().template packet<Aligned>(0);
Packet b_zw = _b.coeffs().template packet<Aligned>(2);
Packet a_xx = ei_pset1(a[0]);
Packet a_yy = ei_pset1(a[1]);
Packet a_zz = ei_pset1(a[2]);
Packet a_ww = ei_pset1(a[3]);
Packet t1, t2;

t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));

#ifdef __SSE3__
ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
ei_pstore(&res.x(), ei_padd(t1, ei_por(mask1,ei_preverse(t2))));

t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
#ifdef __SSE3__
ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
ei_pstore(&res.z(), ei_padd(t1, ei_por(mask2,ei_preverse(t2))));

return res;

Actually, my recent work on the vectorization of complexes, and this
code, let me thought that it would be a good idea to add ei_paddsub
and ei_psubadd functions such that we could write generic vectorized
code for complex and quaternions (generic in the sense it would work
for all vector engine).

Here is how I see it. For instance let's take the example of the
quaternion multiplication. We could have a generic

template<typename Quat> Quat ei_quatmul(Quat& a, Quat& b);

function calling a ei_quatmul_selector which would be specialized for
the 3 following configurations:

1 - ei_packet_traits<Quat::Scalar>::size == 2 => the above code
2 - ei_packet_traits<Quat::Scalar>::size == 4 => the code we already
have but written in a generic way
3 - otherwise => scalar path

And we should make sure that one can specialize this function for a
given scalar type/vector engine in the case some specific
optimizations can be done.


On Sat, Jul 10, 2010 at 1:11 AM, Christoph Hertzberg
<chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Benoit Jacob wrote:
>> I have made a patch letting ei_pset1 use _mm_loaddup_pd when we have SSE3:
>> template<> EIGEN_STRONG_INLINE Packet2d ei_pset1<double>(const double&
>>  from) {
>>  return _mm_loaddup_pd(&from);
>> #else
>>  Packet2d res = _mm_set_sd(from);
>>  return ei_vec2d_swizzle1(res, 0, 0);
>> #endif
>> }
>> But guess what? It's actually not faster (perhaps even a bit slower)
>> than our ei_vec2d_swizzle1!
>> So let's just forget about it.
>> Christoph, is  _mm_loaddup_pd the only SSE3 intrinsic your code is
>> using ? If yes, by using ei_pset1 instead of _mm_loaddup_pd, you can
>> make your code work on SSE2 !
> I guess the most important SSE3 instruction is _mm_addsub_pd which adds the
> first and subtracts the second element. If there is a code which negates
> just one element, this could be replaced.
> Googleing a bit implies that the SSE-way to do it is to XOR with
> {-0.0, 0.0} (or the other way around). I will try that ...
> Christoph
> --
> ----------------------------------------------
> Dipl.-Inf. Christoph Hertzberg
> Cartesium 0.051
> Universität Bremen
> Enrique-Schmidt-Straße 5
> 28359 Bremen
> Tel: (+49) 421-218-64252
> ----------------------------------------------

Mail converted by MHonArc 2.6.19+