Re: [eigen] Issues regarding Quaternion-alignment and const Maps

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/7/13 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> ok, that could probably do the trick, though that looks a bit overkill
> for not much gain (I guess).

I agree it's very complicated hence stuff for 3.1, but I would expect
the gains to be very significant; anyway we won't know until we've
tried ;-)

> On the other hand we could could use a
> similar mechanism to unify all expressions (const and non const)
> having direct access (Map, Block, Matrix, Transpose, Real, Imag, ???).
> Then the question is do we have a writable expression which does not
> have direct access ? Currently, Real and Imag do not, but now they
> could since we support inner-strides. And even there was one, or if we
> add some in the future, maybe that's no big deal to have some code
> duplications for them...

Yes, I agree, Real and Imag seem the only ones and they don't matter
too much, what matters is to unify plain objects, blocks, maps and
Transpose.

> This also raise the question of a mechanism to get the return type of
> an expression. It must be API stable and uniform. Currently for some
> expressions you have to use a helper class, and for some you just
> write Expr<ArgumentType>. With your proposal and/or the above variant,
> this is going to even more complicated.

I agree, we need something in this area.

I wanted to repeat that I completely agree with your discussion with
Hauke above: it's the first priority for 3.0 to enforce const
correctness; the rest is internal details that can be fixed later (we
just need to agree on a supported way of declaring named Map/Block
object, but that's all).

Benoit

>
> gael
>
> On Tue, Jul 13, 2010 at 4:32 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/7/13 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>>> Hi,
>>>
>>> thanks for the patch that I've just applied together with a small
>>> bench in bench/quatmul.cpp which bench at the same time float/double
>>> and with/without vectorization
>>
>> Great!
>>
>>>
>>> regarding the original topic it seems we agreed upon adding an Options
>>> template parameter to Quaternion and other related geometric classes.
>>
>> OK, Indeed I agree that's a good idea.
>>
>>> About the constness issue, Benoit what do you think about the proposal
>>> that emerged ?
>>
>> I don't think that we have to, or that it's OK to accept instantiating
>> twice more code to be const-correct (except if it's ONLY for the 3.0
>> release, see below). The issue is not only about build times, it also
>> is about resulting executable size.
>>
>> We  should be able to be const-correct without instantiating twice
>> more code. Sure we _have_ to instantiate twice more templated types at
>> some level, but the trick is that we should be able to make that level
>> contain very little code.
>>
>> For example, what do you think about having Block<T> just inherit
>> Block<const T> ? Then, if we have some Expression class, there are two
>> cases:
>>
>>  * if Expression is a rvalue expression, we could make sure to only
>> ever instantiate Expression<Block<const T> >, never
>> Expression<Block<T> >. This could be done by introducing a template
>> helper ei_rvalue_base<T> that would return just T in the generic case,
>> and that would be specialized so that ei_rvalue_base<Block<T> >
>> returns ei_rvalue_base<Block<ei_const<T> >   (this is no proper C++,
>> just pseudo code).
>>
>>  * if Expression is a lvalue expression, we basically fall into the
>> discussion we recently had about unifying the code for direct-access
>> expressions. By the way, at that time it escaped to me that plain
>> matrices, themselves, should be unified with other direct-access
>> expressions! Anyway, this issue is more far-reaching.
>>
>> For Eigen 3.0, OK, I agree to be const-correct at the cost of some
>> code redundancy (as a full solution to the above problems would take a
>> long time to develop and by being strict in 3.0 we ease future
>> changes), but it will then become an even bigger priority for 3.1 to
>> fix all that.
>>
>> Cheers,
>> Benoit
>>
>>
>>>
>>> gael
>>>
>>> On Tue, Jul 13, 2010 at 12:23 AM, Christoph Hertzberg
>>> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>> Here comes the patch,
>>>>
>>>> genericity (is that a word?) is based on Gael's suggestions (I just
>>>> replaced ei_por by ei_pxor, and used the same mask for both addsub
>>>> replacements).
>>>>
>>>> I was surprised at first that I only get a speedup of about ~1.6x
>>>> against non-vectorized version, but then found out that originally
>>>> -msse2 was actually slower than the version without vectorization.
>>>> Anyways, now -msse2 and -msse3 both run faster than just -O2 (on my
>>>> Core2 Duo).
>>>> On my (rather archaic) Athlon64 the current sse2 version (it does not
>>>> support sse3) is *slower* than just using -O2  :(
>>>> Which, by the way, reminds me of the original topic of this thread ;)
>>>>
>>>> Christoph
>>>>
>>>>
>>>> Gael Guennebaud schrieb:
>>>>> here a variant using at most as possible generic code:
>>>>>
>>>>> const __m128d mask1  = _mm_castsi128_pd(_mm_set_epi32(0x0,0x0,0x80000000,0x0));
>>>>> const __m128d mask2 = _mm_castsi128_pd(_mm_set_epi32(0x80000000,0x0,0x0,0x0));
>>>>>
>>>>> Quaternion<double> res;
>>>>>
>>>>> typedef ei_packet_traits<double>::type Packet;
>>>>>
>>>>> const double* a = _a.coeffs().data();
>>>>> Packet b_xy = _b.coeffs().template packet<Aligned>(0);
>>>>> Packet b_zw = _b.coeffs().template packet<Aligned>(2);
>>>>> Packet a_xx = ei_pset1(a[0]);
>>>>> Packet a_yy = ei_pset1(a[1]);
>>>>> Packet a_zz = ei_pset1(a[2]);
>>>>> Packet a_ww = ei_pset1(a[3]);
>>>>> Packet t1, t2;
>>>>>
>>>>> t1 = ei_padd(ei_pmul(a_ww, b_xy), ei_pmul(a_yy, b_zw));
>>>>> t2 = ei_psub(ei_pmul(a_zz, b_xy), ei_pmul(a_xx, b_zw));
>>>>>
>>>>> #ifdef __SSE3__
>>>>> ei_pstore(&res.x(), _mm_addsub_pd(t1, ei_preverse(t2)));
>>>>> #else
>>>>> ei_pstore(&res.x(), ei_padd(t1, ei_por(mask1,ei_preverse(t2))));
>>>>> #endif
>>>>>
>>>>> t1 = ei_psub(ei_pmul(a_ww, b_zw), ei_pmul(a_yy, b_xy));
>>>>> t2 = ei_padd(ei_pmul(a_zz, b_zw), ei_pmul(a_xx, b_xy));
>>>>> #ifdef __SSE3__
>>>>> ei_pstore(&res.z(), ei_preverse(_mm_addsub_pd(ei_preverse(t1), t2)));
>>>>> #else
>>>>> ei_pstore(&res.z(), ei_padd(t1, ei_por(mask2,ei_preverse(t2))));
>>>>> #endif
>>>>>
>>>>> return res;
>>>>>
>>>>> Actually, my recent work on the vectorization of complexes, and this
>>>>> code, let me thought that it would be a good idea to add ei_paddsub
>>>>> and ei_psubadd functions such that we could write generic vectorized
>>>>> code for complex and quaternions (generic in the sense it would work
>>>>> for all vector engine).
>>>>>
>>>>> Here is how I see it. For instance let's take the example of the
>>>>> quaternion multiplication. We could have a generic
>>>>>
>>>>> template<typename Quat> Quat ei_quatmul(Quat& a, Quat& b);
>>>>>
>>>>> function calling a ei_quatmul_selector which would be specialized for
>>>>> the 3 following configurations:
>>>>>
>>>>> 1 - ei_packet_traits<Quat::Scalar>::size == 2 => the above code
>>>>> 2 - ei_packet_traits<Quat::Scalar>::size == 4 => the code we already
>>>>> have but written in a generic way
>>>>> 3 - otherwise => scalar path
>>>>>
>>>>> And we should make sure that one can specialize this function for a
>>>>> given scalar type/vector engine in the case some specific
>>>>> optimizations can be done.
>>>>>
>>>>> gael.
>>>>>
>>>>> On Sat, Jul 10, 2010 at 1:11 AM, Christoph Hertzberg
>>>>> <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>> Benoit Jacob wrote:
>>>>>>> I have made a patch letting ei_pset1 use _mm_loaddup_pd when we have SSE3:
>>>>>>>
>>>>>>> template<> EIGEN_STRONG_INLINE Packet2d ei_pset1<double>(const double&
>>>>>>>  from) {
>>>>>>> #ifdef EIGEN_VECTORIZE_SSE3
>>>>>>>  return _mm_loaddup_pd(&from);
>>>>>>> #else
>>>>>>>  Packet2d res = _mm_set_sd(from);
>>>>>>>  return ei_vec2d_swizzle1(res, 0, 0);
>>>>>>> #endif
>>>>>>> }
>>>>>>>
>>>>>>> But guess what? It's actually not faster (perhaps even a bit slower)
>>>>>>> than our ei_vec2d_swizzle1!
>>>>>>>
>>>>>>> So let's just forget about it.
>>>>>>>
>>>>>>> Christoph, is  _mm_loaddup_pd the only SSE3 intrinsic your code is
>>>>>>> using ? If yes, by using ei_pset1 instead of _mm_loaddup_pd, you can
>>>>>>> make your code work on SSE2 !
>>>>>> I guess the most important SSE3 instruction is _mm_addsub_pd which adds the
>>>>>> first and subtracts the second element. If there is a code which negates
>>>>>> just one element, this could be replaced.
>>>>>>
>>>>>> Googleing a bit implies that the SSE-way to do it is to XOR with
>>>>>> {-0.0, 0.0} (or the other way around). I will try that ...
>>>>>>
>>>>>> Christoph
>>>>
>>>>
>>>> --
>>>> ----------------------------------------------
>>>> Dipl.-Inf. Christoph Hertzberg
>>>> Cartesium 0.051
>>>> Universität Bremen
>>>> Enrique-Schmidt-Straße 5
>>>> 28359 Bremen
>>>>
>>>> Tel: (+49) 421-218-64252
>>>> ----------------------------------------------
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/