Re: [eigen] SSE questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/2/2 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> On Tue, Feb 2, 2010 at 8:10 PM, Radu Bogdan Rusu <rusu@xxxxxxxxxxxxxxxx> wrote:
>> Benoit,
>>
>> Thanks for the fast reply.
>>
>> Benoit Jacob wrote:
>>>
>>> 2010/2/1 Radu Bogdan Rusu <rusu@xxxxxxxxxxxxxxxx>:
>>>>
>>>> Hi all,
>>>>
>>>> I have a few questions regarding the use of SSE instructions in the Eigen
>>>> 2.x branch (2.0.11 to be more exact). I've looked at the generated
>>>> assembly
>>>> for some of them, but I just want to double check this with the Eigen
>>>> developers.
>>>>
>>>> 1) Why isn't a Vector4f constructor converted into an _mm_set_ps on an
>>>> SSE
>>>> platform? Looking through Core/arch/SSE, I did not find any reference to
>>>> _mm_set_ps.
>>>
>>> Good question. For now, the Vector4f constructor taking 4 coordinates
>>> indeed copies them without SSE. Indeed, _mm_set_ps is what we need
>>> here. I understand that it could give a real improvement when the
>>> Vector4f thus constructed is used right away in an expression. Patches
>>> welcome :)
>>
>> To generate a clean patch, I would have to dive a bit too much in the Eigen
>> code, but I suppose something along the lines of:
>>
>> template<> EIGEN_STRONG_INLINE __m128  ei_pset<__m128>(const float& a, const
>> float& b, const float& c, const float& d) { return _mm_set_ps(d,c,b,a); }
>>
>> in SSE/PacketMath.h?
>>
>> I assume there must be some other stuff that you guys need to add up to make
>> it work (like AltiVec support?) or GenericPacketMath:
>>
>> template<typename Packet, typename Scalar> inline typename
>> ei_packet_traits<Scalar>::type ei_pset(const Scalar& a, const Scalar& b,
>> const Scalar &c, const Scalar &d) {....
>>
>> I will try to see if that works later today.
>
> adding a ei_pset() function is the first step. The second is to make
> Eigen use it, i.e., you have to adapt the respective Matrix ctors to
> use it when SSE is enabled. To this end, you have to mimic what is
> done for the Quaternion product in Eigen/src/Geometry/Quaternion.h
> (lines 347-373) and Eigen/src/Geometry/arch/.
>
> Also I did a quick bench to see if that's worth the effort, and I
> confirm that's definitely worth it!
>
>>>> 2) Is there any interest in having a specialized 3x3 covariance matrix
>>>> estimation method for the SSE case?
>>>
>>> At this stage I wouldn't do such heavy changes in 2.0, but we can
>>> discuss this for the development branch. I'm not sure how you would
>>> work around the alignment issues at runtime. By copying the matrix
>>> into a temporary 4x4 matrix?
>>
>> What I meant was something like:
>>
>> Eigen::Matrix3f covariance_matrix = Eigen::Matrix3f::Zero ();
>>
>> for loop goes here....
>> {
>>  m128Wrapper point16 = ...;
>>
>>  // Prepare the shufflers
>>  xxxy = point16.shuffle<0, 0, 0, 1> ();
>>  yyzz = point16.shuffle<1, 1, 2, 2> ();
>>  xyzx = point16.shuffle<0, 1, 2, 0> ();
>>  yzxy = point16.shuffle<1, 2, 0, 1> ();
>>
>>  // Multiply 4 + 4
>>  m128Wrapper mat_ptr1 = xxxy * xyzx;
>>  m128Wrapper mat_ptr2 = yyzz * yzxy;
>>  *(__m128*)&covariance_matrix (0, 0) += mat_ptr1.value;
>>  *(__m128*)&covariance_matrix (1, 1) += mat_ptr2.value;
>>  covariance_matrix (2, 2) += point16[2] * point16[2];
>> }
>>
>> where the shuffle is a simple _mm_shuffle_ps, and point16 is an SSE aligned
>> __m128 wrapper structure. It's hard to get things faster than this without
>> the above shufflers :)
>
> how faster is it? If it's worth it, we could add a
> MatrixBase::rankUpdate(const MatrixBase<OtherDerived>&) function with
> a specialization for Vector3f like vectors, then the API would be:
>
> for(...)
>  cov_mat.rankUpdate(pt[i]);
>
> that is consistent with the SelfadjointView API:
>
> for(...)
>  cov_mat.selfadjointView<Upper>().rankUpdate(pt[i]); // here only the
> upper triangular part is updated.
>
>>>> 4) Is this the recommended optimized way to get a dot product between a
>>>> VectorXf and a Vector4f ?
>>>>
>>>> float d = ((Eigen::Vector4f)my_vectorxf).start<4>().dot (my_vector4f);
>>
>> [...]
>>>
>>> my_vectorxf.start<4>().dot(my_vector4f)
>>
>> It seems like it's also working without the <4> if my_vectorxf was set to a
>> Vector4f a priori...
>>
>> Eigen::VectorXf my_vectorxf;
>> my_vectorxf = Eigen::Vector4f (x, y, z, a);
>>
>> float d = my_vectorxf.dot (my_vector4f);
>
> yes, but only if my_vectorxf().size() == 4, so why not directly using
> a Vector4f ?
>
>> This is guaranteed to get optimized, right?
>
> Not yet, because we currently do not take care at picking the lowest
> compile time size between the two vectors, but if you use two Vector4f
> (or use my_vectorxf.start<4>()), then yes it is optimized.
>
>>
>> Which brings me to the next point :)
>>
>>
>> 5) Can we add in dot product optimization too for SSE4 (_mm_dp_ps) ?
>>
>> http://www.intel.com/technology/itj/2008/v12i3/3-paper/6-examples.htm
>>
>>
>> PS. The point of my e-mails is that I am trying to get rid of an SSE
>> structure wrapper that I wrote and go with Eigen::Vector4f all the way. That
>> should make sure that the code is better maintainable.
>
> I thought Benoit already did it?

Nope, the only SSE4 stuff I did was integer multiplication.

This is actually on our TODO,
http://eigen.tuxfamily.org/index.php?title=Todo_for_3.0
but with a very low priority since it affects only SSE4 hardware.

Benoit

>
> gael.
>
>>
>> Cheers,
>> Radu.
>> --
>> | Radu Bogdan Rusu | http://rbrusu.com/
>>
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/