Re: [eigen] SSE questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Tue, Feb 2, 2010 at 8:10 PM, Radu Bogdan Rusu <rusu@xxxxxxxxxxxxxxxx> wrote:
> Benoit,
>
> Thanks for the fast reply.
>
> Benoit Jacob wrote:
>>
>> 2010/2/1 Radu Bogdan Rusu <rusu@xxxxxxxxxxxxxxxx>:
>>>
>>> Hi all,
>>>
>>> I have a few questions regarding the use of SSE instructions in the Eigen
>>> 2.x branch (2.0.11 to be more exact). I've looked at the generated
>>> assembly
>>> for some of them, but I just want to double check this with the Eigen
>>> developers.
>>>
>>> 1) Why isn't a Vector4f constructor converted into an _mm_set_ps on an
>>> SSE
>>> platform? Looking through Core/arch/SSE, I did not find any reference to
>>> _mm_set_ps.
>>
>> Good question. For now, the Vector4f constructor taking 4 coordinates
>> indeed copies them without SSE. Indeed, _mm_set_ps is what we need
>> here. I understand that it could give a real improvement when the
>> Vector4f thus constructed is used right away in an expression. Patches
>> welcome :)
>
> To generate a clean patch, I would have to dive a bit too much in the Eigen
> code, but I suppose something along the lines of:
>
> template<> EIGEN_STRONG_INLINE __m128  ei_pset<__m128>(const float& a, const
> float& b, const float& c, const float& d) { return _mm_set_ps(d,c,b,a); }
>
> in SSE/PacketMath.h?
>
> I assume there must be some other stuff that you guys need to add up to make
> it work (like AltiVec support?) or GenericPacketMath:
>
> template<typename Packet, typename Scalar> inline typename
> ei_packet_traits<Scalar>::type ei_pset(const Scalar& a, const Scalar& b,
> const Scalar &c, const Scalar &d) {....
>
> I will try to see if that works later today.

adding a ei_pset() function is the first step. The second is to make
Eigen use it, i.e., you have to adapt the respective Matrix ctors to
use it when SSE is enabled. To this end, you have to mimic what is
done for the Quaternion product in Eigen/src/Geometry/Quaternion.h
(lines 347-373) and Eigen/src/Geometry/arch/.

Also I did a quick bench to see if that's worth the effort, and I
confirm that's definitely worth it!

>>> 2) Is there any interest in having a specialized 3x3 covariance matrix
>>> estimation method for the SSE case?
>>
>> At this stage I wouldn't do such heavy changes in 2.0, but we can
>> discuss this for the development branch. I'm not sure how you would
>> work around the alignment issues at runtime. By copying the matrix
>> into a temporary 4x4 matrix?
>
> What I meant was something like:
>
> Eigen::Matrix3f covariance_matrix = Eigen::Matrix3f::Zero ();
>
> for loop goes here....
> {
>  m128Wrapper point16 = ...;
>
>  // Prepare the shufflers
>  xxxy = point16.shuffle<0, 0, 0, 1> ();
>  yyzz = point16.shuffle<1, 1, 2, 2> ();
>  xyzx = point16.shuffle<0, 1, 2, 0> ();
>  yzxy = point16.shuffle<1, 2, 0, 1> ();
>
>  // Multiply 4 + 4
>  m128Wrapper mat_ptr1 = xxxy * xyzx;
>  m128Wrapper mat_ptr2 = yyzz * yzxy;
>  *(__m128*)&covariance_matrix (0, 0) += mat_ptr1.value;
>  *(__m128*)&covariance_matrix (1, 1) += mat_ptr2.value;
>  covariance_matrix (2, 2) += point16[2] * point16[2];
> }
>
> where the shuffle is a simple _mm_shuffle_ps, and point16 is an SSE aligned
> __m128 wrapper structure. It's hard to get things faster than this without
> the above shufflers :)

how faster is it? If it's worth it, we could add a
MatrixBase::rankUpdate(const MatrixBase<OtherDerived>&) function with
a specialization for Vector3f like vectors, then the API would be:

for(...)
  cov_mat.rankUpdate(pt[i]);

that is consistent with the SelfadjointView API:

for(...)
  cov_mat.selfadjointView<Upper>().rankUpdate(pt[i]); // here only the
upper triangular part is updated.

>>> 4) Is this the recommended optimized way to get a dot product between a
>>> VectorXf and a Vector4f ?
>>>
>>> float d = ((Eigen::Vector4f)my_vectorxf).start<4>().dot (my_vector4f);
>
> [...]
>>
>> my_vectorxf.start<4>().dot(my_vector4f)
>
> It seems like it's also working without the <4> if my_vectorxf was set to a
> Vector4f a priori...
>
> Eigen::VectorXf my_vectorxf;
> my_vectorxf = Eigen::Vector4f (x, y, z, a);
>
> float d = my_vectorxf.dot (my_vector4f);

yes, but only if my_vectorxf().size() == 4, so why not directly using
a Vector4f ?

> This is guaranteed to get optimized, right?

Not yet, because we currently do not take care at picking the lowest
compile time size between the two vectors, but if you use two Vector4f
(or use my_vectorxf.start<4>()), then yes it is optimized.

>
> Which brings me to the next point :)
>
>
> 5) Can we add in dot product optimization too for SSE4 (_mm_dp_ps) ?
>
> http://www.intel.com/technology/itj/2008/v12i3/3-paper/6-examples.htm
>
>
> PS. The point of my e-mails is that I am trying to get rid of an SSE
> structure wrapper that I wrote and go with Eigen::Vector4f all the way. That
> should make sure that the code is better maintainable.

I thought Benoit already did it?

gael.

>
> Cheers,
> Radu.
> --
> | Radu Bogdan Rusu | http://rbrusu.com/
>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/