Re: [eigen] Issues regarding Quaternion-alignment and const Maps

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi,

thanks for the patch that I've just applied together with a small
bench in bench/quatmul.cpp which bench at the same time float/double
and with/without vectorization

regarding the original topic it seems we agreed upon adding an Options
template parameter to Quaternion and other related geometric classes.
About the constness issue, Benoit what do you think about the proposal
that emerged ?

gael

On Tue, Jul 13, 2010 at 12:23 AM, Christoph Hertzberg
<chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Here comes the patch,
>
> genericity (is that a word?) is based on Gael's suggestions (I just
> replaced ei_por by ei_pxor, and used the same mask for both addsub
> replacements).
>
> I was surprised at first that I only get a speedup of about ~1.6x
> against non-vectorized version, but then found out that originally
> -msse2 was actually slower than the version without vectorization.
> Anyways, now -msse2 and -msse3 both run faster than just -O2 (on my
> Core2 Duo).
> On my (rather archaic) Athlon64 the current sse2 version (it does not
> support sse3) is *slower* than just using -O2  :(
> Which, by the way, reminds me of the original topic of this thread ;)
>
> Christoph
>
>
> Gael Guennebaud schrieb:
>> here a variant using at most as possible generic code:
>>
>> const __m128d mask1  = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
>> const __m128d mask2 = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));
>>
>> Quaternion<double> res;
>>
>> typedef ei_packet_traits<double>::type Packet;
>>
>> const double* a = _a.coeffs().data();
>> Packet b_xy = _b.coeffs().template packet<Aligned>(0);
>> Packet b_zw = _b.coeffs().template packet<Aligned>(2);
>> Packet a_xx = ei_pset1(a[0]);
>> Packet a_yy = ei_pset1(a[1]);
>> Packet a_zz = ei_pset1(a[2]);
>> Packet a_ww = ei_pset1(a[3]);
>> Packet t1, t2;
>>
>> t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
>> t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
>>
>> #ifdef __SSE3__
>> ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
>> #else
>> ei_pstore(&res.x(), ei_padd(t1, ei_por(mask1,ei_preverse(t2))));
>> #endif
>>
>> t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
>> t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
>> #ifdef __SSE3__
>> ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
>> #else
>> ei_pstore(&res.z(), ei_padd(t1, ei_por(mask2,ei_preverse(t2))));
>> #endif
>>
>> return res;
>>
>> Actually, my recent work on the vectorization of complexes, and this
>> code, let me thought that it would be a good idea to add ei_paddsub
>> and ei_psubadd functions such that we could write generic vectorized
>> code for complex and quaternions (generic in the sense it would work
>> for all vector engine).
>>
>> Here is how I see it. For instance let's take the example of the
>> quaternion multiplication. We could have a generic
>>
>> template<typename Quat> Quat ei_quatmul(Quat& a, Quat& b);
>>
>> function calling a ei_quatmul_selector which would be specialized for
>> the 3 following configurations:
>>
>> 1 - ei_packet_traits<Quat::Scalar>::size == 2 => the above code
>> 2 - ei_packet_traits<Quat::Scalar>::size == 4 => the code we already
>> have but written in a generic way
>> 3 - otherwise => scalar path
>>
>> And we should make sure that one can specialize this function for a
>> given scalar type/vector engine in the case some specific
>> optimizations can be done.
>>
>> gael.
>>
>> On Sat, Jul 10, 2010 at 1:11 AM, Christoph Hertzberg
>> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>> Benoit Jacob wrote:
>>>> I have made a patch letting ei_pset1 use _mm_loaddup_pd when we have SSE3:
>>>>
>>>> template<> EIGEN_STRONG_INLINE Packet2d ei_pset1<double>(const double&
>>>>  from) {
>>>> #ifdef EIGEN_VECTORIZE_SSE3
>>>>  return _mm_loaddup_pd(&from);
>>>> #else
>>>>  Packet2d res = _mm_set_sd(from);
>>>>  return ei_vec2d_swizzle1(res, 0, 0);
>>>> #endif
>>>> }
>>>>
>>>> But guess what? It's actually not faster (perhaps even a bit slower)
>>>> than our ei_vec2d_swizzle1!
>>>>
>>>> So let's just forget about it.
>>>>
>>>> Christoph, is  _mm_loaddup_pd the only SSE3 intrinsic your code is
>>>> using ? If yes, by using ei_pset1 instead of _mm_loaddup_pd, you can
>>>> make your code work on SSE2 !
>>> I guess the most important SSE3 instruction is _mm_addsub_pd which adds the
>>> first and subtracts the second element. If there is a code which negates
>>> just one element, this could be replaced.
>>>
>>> Googleing a bit implies that the SSE-way to do it is to XOR with
>>> {-0.0, 0.0} (or the other way around). I will try that ...
>>>
>>> Christoph
>
>
> --
> ----------------------------------------------
> Dipl.-Inf. Christoph Hertzberg
> Cartesium 0.051
> Universität Bremen
> Enrique-Schmidt-Straße 5
> 28359 Bremen
>
> Tel: (+49) 421-218-64252
> ----------------------------------------------
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/