Re: [eigen] Issues regarding Quaternion-alignment and const Maps

[ Thread Index | Date Index | More Archives ]

2010/7/13 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> Hi,
> thanks for the patch that I've just applied together with a small
> bench in bench/quatmul.cpp which bench at the same time float/double
> and with/without vectorization


> regarding the original topic it seems we agreed upon adding an Options
> template parameter to Quaternion and other related geometric classes.

OK, Indeed I agree that's a good idea.

> About the constness issue, Benoit what do you think about the proposal
> that emerged ?

I don't think that we have to, or that it's OK to accept instantiating
twice more code to be const-correct (except if it's ONLY for the 3.0
release, see below). The issue is not only about build times, it also
is about resulting executable size.

We  should be able to be const-correct without instantiating twice
more code. Sure we _have_ to instantiate twice more templated types at
some level, but the trick is that we should be able to make that level
contain very little code.

For example, what do you think about having Block<T> just inherit
Block<const T> ? Then, if we have some Expression class, there are two

 * if Expression is a rvalue expression, we could make sure to only
ever instantiate Expression<Block<const T> >, never
Expression<Block<T> >. This could be done by introducing a template
helper ei_rvalue_base<T> that would return just T in the generic case,
and that would be specialized so that ei_rvalue_base<Block<T> >
returns ei_rvalue_base<Block<ei_const<T> >   (this is no proper C++,
just pseudo code).

 * if Expression is a lvalue expression, we basically fall into the
discussion we recently had about unifying the code for direct-access
expressions. By the way, at that time it escaped to me that plain
matrices, themselves, should be unified with other direct-access
expressions! Anyway, this issue is more far-reaching.

For Eigen 3.0, OK, I agree to be const-correct at the cost of some
code redundancy (as a full solution to the above problems would take a
long time to develop and by being strict in 3.0 we ease future
changes), but it will then become an even bigger priority for 3.1 to
fix all that.


> gael
> On Tue, Jul 13, 2010 at 12:23 AM, Christoph Hertzberg
> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> Here comes the patch,
>> genericity (is that a word?) is based on Gael's suggestions (I just
>> replaced ei_por by ei_pxor, and used the same mask for both addsub
>> replacements).
>> I was surprised at first that I only get a speedup of about ~1.6x
>> against non-vectorized version, but then found out that originally
>> -msse2 was actually slower than the version without vectorization.
>> Anyways, now -msse2 and -msse3 both run faster than just -O2 (on my
>> Core2 Duo).
>> On my (rather archaic) Athlon64 the current sse2 version (it does not
>> support sse3) is *slower* than just using -O2  :(
>> Which, by the way, reminds me of the original topic of this thread ;)
>> Christoph
>> Gael Guennebaud schrieb:
>>> here a variant using at most as possible generic code:
>>> const __m128d mask1  = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
>>> const __m128d mask2 = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));
>>> Quaternion<double> res;
>>> typedef ei_packet_traits<double>::type Packet;
>>> const double* a = _a.coeffs().data();
>>> Packet b_xy = _b.coeffs().template packet<Aligned>(0);
>>> Packet b_zw = _b.coeffs().template packet<Aligned>(2);
>>> Packet a_xx = ei_pset1(a[0]);
>>> Packet a_yy = ei_pset1(a[1]);
>>> Packet a_zz = ei_pset1(a[2]);
>>> Packet a_ww = ei_pset1(a[3]);
>>> Packet t1, t2;
>>> t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
>>> t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
>>> #ifdef __SSE3__
>>> ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
>>> #else
>>> ei_pstore(&res.x(), ei_padd(t1, ei_por(mask1,ei_preverse(t2))));
>>> #endif
>>> t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
>>> t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
>>> #ifdef __SSE3__
>>> ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
>>> #else
>>> ei_pstore(&res.z(), ei_padd(t1, ei_por(mask2,ei_preverse(t2))));
>>> #endif
>>> return res;
>>> Actually, my recent work on the vectorization of complexes, and this
>>> code, let me thought that it would be a good idea to add ei_paddsub
>>> and ei_psubadd functions such that we could write generic vectorized
>>> code for complex and quaternions (generic in the sense it would work
>>> for all vector engine).
>>> Here is how I see it. For instance let's take the example of the
>>> quaternion multiplication. We could have a generic
>>> template<typename Quat> Quat ei_quatmul(Quat& a, Quat& b);
>>> function calling a ei_quatmul_selector which would be specialized for
>>> the 3 following configurations:
>>> 1 - ei_packet_traits<Quat::Scalar>::size == 2 => the above code
>>> 2 - ei_packet_traits<Quat::Scalar>::size == 4 => the code we already
>>> have but written in a generic way
>>> 3 - otherwise => scalar path
>>> And we should make sure that one can specialize this function for a
>>> given scalar type/vector engine in the case some specific
>>> optimizations can be done.
>>> gael.
>>> On Sat, Jul 10, 2010 at 1:11 AM, Christoph Hertzberg
>>> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>> Benoit Jacob wrote:
>>>>> I have made a patch letting ei_pset1 use _mm_loaddup_pd when we have SSE3:
>>>>> template<> EIGEN_STRONG_INLINE Packet2d ei_pset1<double>(const double&
>>>>>  from) {
>>>>>  return _mm_loaddup_pd(&from);
>>>>> #else
>>>>>  Packet2d res = _mm_set_sd(from);
>>>>>  return ei_vec2d_swizzle1(res, 0, 0);
>>>>> #endif
>>>>> }
>>>>> But guess what? It's actually not faster (perhaps even a bit slower)
>>>>> than our ei_vec2d_swizzle1!
>>>>> So let's just forget about it.
>>>>> Christoph, is  _mm_loaddup_pd the only SSE3 intrinsic your code is
>>>>> using ? If yes, by using ei_pset1 instead of _mm_loaddup_pd, you can
>>>>> make your code work on SSE2 !
>>>> I guess the most important SSE3 instruction is _mm_addsub_pd which adds the
>>>> first and subtracts the second element. If there is a code which negates
>>>> just one element, this could be replaced.
>>>> Googleing a bit implies that the SSE-way to do it is to XOR with
>>>> {-0.0, 0.0} (or the other way around). I will try that ...
>>>> Christoph
>> --
>> ----------------------------------------------
>> Dipl.-Inf. Christoph Hertzberg
>> Cartesium 0.051
>> Universität Bremen
>> Enrique-Schmidt-Straße 5
>> 28359 Bremen
>> Tel: (+49) 421-218-64252
>> ----------------------------------------------

Mail converted by MHonArc 2.6.19+