Re: [eigen] Issues regarding Quaternion-alignment and const Maps

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


(Letting Gael handle this since he now looked at this topic closer than I did).

2010/7/12 Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>:
> Here comes the patch,
>
> genericity (is that a word?) is based on Gael's suggestions (I just
> replaced ei_por by ei_pxor, and used the same mask for both addsub
> replacements).
>
> I was surprised at first that I only get a speedup of about ~1.6x
> against non-vectorized version, but then found out that originally
> -msse2 was actually slower than the version without vectorization.
> Anyways, now -msse2 and -msse3 both run faster than just -O2 (on my
> Core2 Duo).
> On my (rather archaic) Athlon64 the current sse2 version (it does not
> support sse3) is *slower* than just using -O2  :(
> Which, by the way, reminds me of the original topic of this thread ;)
>
> Christoph
>
>
> Gael Guennebaud schrieb:
>> here a variant using at most as possible generic code:
>>
>> const __m128d mask1  = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
>> const __m128d mask2 = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));
>>
>> Quaternion<double> res;
>>
>> typedef ei_packet_traits<double>::type Packet;
>>
>> const double* a = _a.coeffs().data();
>> Packet b_xy = _b.coeffs().template packet<Aligned>(0);
>> Packet b_zw = _b.coeffs().template packet<Aligned>(2);
>> Packet a_xx = ei_pset1(a[0]);
>> Packet a_yy = ei_pset1(a[1]);
>> Packet a_zz = ei_pset1(a[2]);
>> Packet a_ww = ei_pset1(a[3]);
>> Packet t1, t2;
>>
>> t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
>> t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
>>
>> #ifdef __SSE3__
>> ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
>> #else
>> ei_pstore(&res.x(), ei_padd(t1, ei_por(mask1,ei_preverse(t2))));
>> #endif
>>
>> t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
>> t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
>> #ifdef __SSE3__
>> ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
>> #else
>> ei_pstore(&res.z(), ei_padd(t1, ei_por(mask2,ei_preverse(t2))));
>> #endif
>>
>> return res;
>>
>> Actually, my recent work on the vectorization of complexes, and this
>> code, let me thought that it would be a good idea to add ei_paddsub
>> and ei_psubadd functions such that we could write generic vectorized
>> code for complex and quaternions (generic in the sense it would work
>> for all vector engine).
>>
>> Here is how I see it. For instance let's take the example of the
>> quaternion multiplication. We could have a generic
>>
>> template<typename Quat> Quat ei_quatmul(Quat& a, Quat& b);
>>
>> function calling a ei_quatmul_selector which would be specialized for
>> the 3 following configurations:
>>
>> 1 - ei_packet_traits<Quat::Scalar>::size == 2 => the above code
>> 2 - ei_packet_traits<Quat::Scalar>::size == 4 => the code we already
>> have but written in a generic way
>> 3 - otherwise => scalar path
>>
>> And we should make sure that one can specialize this function for a
>> given scalar type/vector engine in the case some specific
>> optimizations can be done.
>>
>> gael.
>>
>> On Sat, Jul 10, 2010 at 1:11 AM, Christoph Hertzberg
>> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>> Benoit Jacob wrote:
>>>> I have made a patch letting ei_pset1 use _mm_loaddup_pd when we have SSE3:
>>>>
>>>> template<> EIGEN_STRONG_INLINE Packet2d ei_pset1<double>(const double&
>>>>  from) {
>>>> #ifdef EIGEN_VECTORIZE_SSE3
>>>>  return _mm_loaddup_pd(&from);
>>>> #else
>>>>  Packet2d res = _mm_set_sd(from);
>>>>  return ei_vec2d_swizzle1(res, 0, 0);
>>>> #endif
>>>> }
>>>>
>>>> But guess what? It's actually not faster (perhaps even a bit slower)
>>>> than our ei_vec2d_swizzle1!
>>>>
>>>> So let's just forget about it.
>>>>
>>>> Christoph, is  _mm_loaddup_pd the only SSE3 intrinsic your code is
>>>> using ? If yes, by using ei_pset1 instead of _mm_loaddup_pd, you can
>>>> make your code work on SSE2 !
>>> I guess the most important SSE3 instruction is _mm_addsub_pd which adds the
>>> first and subtracts the second element. If there is a code which negates
>>> just one element, this could be replaced.
>>>
>>> Googleing a bit implies that the SSE-way to do it is to XOR with
>>> {-0.0, 0.0} (or the other way around). I will try that ...
>>>
>>> Christoph
>
>
> --
> ----------------------------------------------
> Dipl.-Inf. Christoph Hertzberg
> Cartesium 0.051
> Universität Bremen
> Enrique-Schmidt-Straße 5
> 28359 Bremen
>
> Tel: (+49) 421-218-64252
> ----------------------------------------------
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/