Re: [eigen] SSE questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Wed, Feb 3, 2010 at 3:05 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/2/2 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> On Tue, Feb 2, 2010 at 8:10 PM, Radu Bogdan Rusu <rusu@xxxxxxxxxxxxxxxx> wrote:
>>> Benoit,
>>>
>>> Thanks for the fast reply.
>>>
>>> Benoit Jacob wrote:
>>>>
>>>> 2010/2/1 Radu Bogdan Rusu <rusu@xxxxxxxxxxxxxxxx>:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a few questions regarding the use of SSE instructions in the Eigen
>>>>> 2.x branch (2.0.11 to be more exact). I've looked at the generated
>>>>> assembly
>>>>> for some of them, but I just want to double check this with the Eigen
>>>>> developers.
>>>>>
>>>>> 1) Why isn't a Vector4f constructor converted into an _mm_set_ps on an
>>>>> SSE
>>>>> platform? Looking through Core/arch/SSE, I did not find any reference to
>>>>> _mm_set_ps.
>>>>
>>>> Good question. For now, the Vector4f constructor taking 4 coordinates
>>>> indeed copies them without SSE. Indeed, _mm_set_ps is what we need
>>>> here. I understand that it could give a real improvement when the
>>>> Vector4f thus constructed is used right away in an expression. Patches
>>>> welcome :)
>>>
>>> To generate a clean patch, I would have to dive a bit too much in the Eigen
>>> code, but I suppose something along the lines of:
>>>
>>> template<> EIGEN_STRONG_INLINE __m128  ei_pset<__m128>(const float& a, const
>>> float& b, const float& c, const float& d) { return _mm_set_ps(d,c,b,a); }
>>>
>>> in SSE/PacketMath.h?
>>>
>>> I assume there must be some other stuff that you guys need to add up to make
>>> it work (like AltiVec support?) or GenericPacketMath:
>>>
>>> template<typename Packet, typename Scalar> inline typename
>>> ei_packet_traits<Scalar>::type ei_pset(const Scalar& a, const Scalar& b,
>>> const Scalar &c, const Scalar &d) {....
>>>
>>> I will try to see if that works later today.
>>
>> adding a ei_pset() function is the first step. The second is to make
>> Eigen use it, i.e., you have to adapt the respective Matrix ctors to
>> use it when SSE is enabled. To this end, you have to mimic what is
>> done for the Quaternion product in Eigen/src/Geometry/Quaternion.h
>> (lines 347-373) and Eigen/src/Geometry/arch/.
>>
>> Also I did a quick bench to see if that's worth the effort, and I
>> confirm that's definitely worth it!
>>
>>>>> 2) Is there any interest in having a specialized 3x3 covariance matrix
>>>>> estimation method for the SSE case?
>>>>
>>>> At this stage I wouldn't do such heavy changes in 2.0, but we can
>>>> discuss this for the development branch. I'm not sure how you would
>>>> work around the alignment issues at runtime. By copying the matrix
>>>> into a temporary 4x4 matrix?
>>>
>>> What I meant was something like:
>>>
>>> Eigen::Matrix3f covariance_matrix = Eigen::Matrix3f::Zero ();
>>>
>>> for loop goes here....
>>> {
>>>  m128Wrapper point16 = ...;
>>>
>>>  // Prepare the shufflers
>>>  xxxy = point16.shuffle<0, 0, 0, 1> ();
>>>  yyzz = point16.shuffle<1, 1, 2, 2> ();
>>>  xyzx = point16.shuffle<0, 1, 2, 0> ();
>>>  yzxy = point16.shuffle<1, 2, 0, 1> ();
>>>
>>>  // Multiply 4 + 4
>>>  m128Wrapper mat_ptr1 = xxxy * xyzx;
>>>  m128Wrapper mat_ptr2 = yyzz * yzxy;
>>>  *(__m128*)&covariance_matrix (0, 0) += mat_ptr1.value;
>>>  *(__m128*)&covariance_matrix (1, 1) += mat_ptr2.value;
>>>  covariance_matrix (2, 2) += point16[2] * point16[2];
>>> }
>>>
>>> where the shuffle is a simple _mm_shuffle_ps, and point16 is an SSE aligned
>>> __m128 wrapper structure. It's hard to get things faster than this without
>>> the above shufflers :)
>>
>> how faster is it? If it's worth it, we could add a
>> MatrixBase::rankUpdate(const MatrixBase<OtherDerived>&) function with
>> a specialization for Vector3f like vectors, then the API would be:
>>
>> for(...)
>>  cov_mat.rankUpdate(pt[i]);
>>
>> that is consistent with the SelfadjointView API:
>>
>> for(...)
>>  cov_mat.selfadjointView<Upper>().rankUpdate(pt[i]); // here only the
>> upper triangular part is updated.
>>
>>>>> 4) Is this the recommended optimized way to get a dot product between a
>>>>> VectorXf and a Vector4f ?
>>>>>
>>>>> float d = ((Eigen::Vector4f)my_vectorxf).start<4>().dot (my_vector4f);
>>>
>>> [...]
>>>>
>>>> my_vectorxf.start<4>().dot(my_vector4f)
>>>
>>> It seems like it's also working without the <4> if my_vectorxf was set to a
>>> Vector4f a priori...
>>>
>>> Eigen::VectorXf my_vectorxf;
>>> my_vectorxf = Eigen::Vector4f (x, y, z, a);
>>>
>>> float d = my_vectorxf.dot (my_vector4f);
>>
>> yes, but only if my_vectorxf().size() == 4, so why not directly using
>> a Vector4f ?
>>
>>> This is guaranteed to get optimized, right?
>>
>> Not yet, because we currently do not take care at picking the lowest
>> compile time size between the two vectors, but if you use two Vector4f
>> (or use my_vectorxf.start<4>()), then yes it is optimized.
>>
>>>
>>> Which brings me to the next point :)
>>>
>>>
>>> 5) Can we add in dot product optimization too for SSE4 (_mm_dp_ps) ?
>>>
>>> http://www.intel.com/technology/itj/2008/v12i3/3-paper/6-examples.htm
>>>
>>>
>>> PS. The point of my e-mails is that I am trying to get rid of an SSE
>>> structure wrapper that I wrote and go with Eigen::Vector4f all the way. That
>>> should make sure that the code is better maintainable.
>>
>> I thought Benoit already did it?
>
> Nope, the only SSE4 stuff I did was integer multiplication.
>
> This is actually on our TODO,
> http://eigen.tuxfamily.org/index.php?title=Todo_for_3.0
> but with a very low priority since it affects only SSE4 hardware.
>

SSE4.1 is actually *very* useful.

It's a shame that no AMD processor carries it. Looking into the future
as well, I am skeptical if they will before 2011.
> Benoit
>
>>
>> gael.
>>
>>>
>>> Cheers,
>>> Radu.
>>> --
>>> | Radu Bogdan Rusu | http://rbrusu.com/
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>



-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/