Re: [eigen] SSE questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Benoit,

Thanks for the fast reply.

Benoit Jacob wrote:
2010/2/1 Radu Bogdan Rusu <rusu@xxxxxxxxxxxxxxxx>:
Hi all,

I have a few questions regarding the use of SSE instructions in the Eigen
2.x branch (2.0.11 to be more exact). I've looked at the generated assembly
for some of them, but I just want to double check this with the Eigen
developers.

1) Why isn't a Vector4f constructor converted into an _mm_set_ps on an SSE
platform? Looking through Core/arch/SSE, I did not find any reference to
_mm_set_ps.

Good question. For now, the Vector4f constructor taking 4 coordinates
indeed copies them without SSE. Indeed, _mm_set_ps is what we need
here. I understand that it could give a real improvement when the
Vector4f thus constructed is used right away in an expression. Patches
welcome :)

To generate a clean patch, I would have to dive a bit too much in the Eigen code, but I suppose something along the lines of:

template<> EIGEN_STRONG_INLINE __m128 ei_pset<__m128>(const float& a, const float& b, const float& c, const float& d) { return _mm_set_ps(d,c,b,a); }

in SSE/PacketMath.h?

I assume there must be some other stuff that you guys need to add up to make it work (like AltiVec support?) or GenericPacketMath:

template<typename Packet, typename Scalar> inline typename ei_packet_traits<Scalar>::type ei_pset(const Scalar& a, const Scalar& b, const Scalar &c, const Scalar &d) {....

I will try to see if that works later today.

2) Is there any interest in having a specialized 3x3 covariance matrix
estimation method for the SSE case?

At this stage I wouldn't do such heavy changes in 2.0, but we can
discuss this for the development branch. I'm not sure how you would
work around the alignment issues at runtime. By copying the matrix
into a temporary 4x4 matrix?

What I meant was something like:

Eigen::Matrix3f covariance_matrix = Eigen::Matrix3f::Zero ();

for loop goes here....
{
  m128Wrapper point16 = ...;

  // Prepare the shufflers
  xxxy = point16.shuffle<0, 0, 0, 1> ();
  yyzz = point16.shuffle<1, 1, 2, 2> ();
  xyzx = point16.shuffle<0, 1, 2, 0> ();
  yzxy = point16.shuffle<1, 2, 0, 1> ();

  // Multiply 4 + 4
  m128Wrapper mat_ptr1 = xxxy * xyzx;
  m128Wrapper mat_ptr2 = yyzz * yzxy;
  *(__m128*)&covariance_matrix (0, 0) += mat_ptr1.value;
  *(__m128*)&covariance_matrix (1, 1) += mat_ptr2.value;
  covariance_matrix (2, 2) += point16[2] * point16[2];
}

where the shuffle is a simple _mm_shuffle_ps, and point16 is an SSE aligned __m128 wrapper structure. It's hard to get things faster than this without the above shufflers :)

4) Is this the recommended optimized way to get a dot product between a
VectorXf and a Vector4f ?

float d = ((Eigen::Vector4f)my_vectorxf).start<4>().dot (my_vector4f);
[...]
my_vectorxf.start<4>().dot(my_vector4f)

It seems like it's also working without the <4> if my_vectorxf was set to a Vector4f a priori...

Eigen::VectorXf my_vectorxf;
my_vectorxf = Eigen::Vector4f (x, y, z, a);

float d = my_vectorxf.dot (my_vector4f);

This is guaranteed to get optimized, right?

Which brings me to the next point :)


5) Can we add in dot product optimization too for SSE4 (_mm_dp_ps) ?

http://www.intel.com/technology/itj/2008/v12i3/3-paper/6-examples.htm


PS. The point of my e-mails is that I am trying to get rid of an SSE structure wrapper that I wrote and go with Eigen::Vector4f all the way. That should make sure that the code is better maintainable.

Cheers,
Radu.
--
| Radu Bogdan Rusu | http://rbrusu.com/




Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/