[eigen] Re: small sums: vectorization not worth it

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Act 1, 4th and last scene:

i wanted to see for which size it starts getting beneficial to vectorize...

attached is one more example (sum.cpp) this time with size 64, type
double. unrolling forced.

Result: x87 is still 50% faster than SSE2.

The SSE2  asm is:

#APP
# 12 "dot.cpp" 1
	#a
# 0 "" 2
#NO_APP
	movapd	(%esi), %xmm1
	movapd	-104(%ebp), %xmm0
	addpd	-88(%ebp), %xmm0
	addpd	-120(%ebp), %xmm0
	addpd	-136(%ebp), %xmm0
	addpd	-152(%ebp), %xmm0
	addpd	-168(%ebp), %xmm0
	addpd	-184(%ebp), %xmm0
	addpd	-200(%ebp), %xmm0
	addpd	-216(%ebp), %xmm0
	addpd	-232(%ebp), %xmm0
	addpd	-248(%ebp), %xmm0
	addpd	-264(%ebp), %xmm0
	addpd	-280(%ebp), %xmm0
	addpd	-296(%ebp), %xmm0
	addpd	-312(%ebp), %xmm0
	addpd	-328(%ebp), %xmm0
	addpd	-344(%ebp), %xmm0
	addpd	-360(%ebp), %xmm0
	addpd	-376(%ebp), %xmm0
	addpd	-392(%ebp), %xmm0
	addpd	-408(%ebp), %xmm0
	addpd	-424(%ebp), %xmm0
	addpd	-440(%ebp), %xmm0
	addpd	-456(%ebp), %xmm0
	addpd	-472(%ebp), %xmm0
	addpd	-488(%ebp), %xmm0
	addpd	-504(%ebp), %xmm0
	addpd	-520(%ebp), %xmm0
	addpd	-536(%ebp), %xmm0
	addpd	-552(%ebp), %xmm0
	addpd	-568(%ebp), %xmm0
	addpd	%xmm0, %xmm1
	movapd	%xmm1, %xmm2
	unpckhpd	%xmm1, %xmm2
	addsd	%xmm2, %xmm1
	movapd	%xmm1, %xmm2
	movsd	%xmm2, -584(%ebp)
#APP
# 14 "dot.cpp" 1
	#b
# 0 "" 2
#NO_APP



2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> and in case you wonder: the situation is almost the same for dot product
>
> (attached file dot.cpp)
>
> runs 2x slower with SSE...
>
> Cheers,
> Benoit
>
> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> Note: i think the old heuristic was wrong anyway.
>>
>> Maybe take this occasion to introduce a EIGEN_COST_OF_PREDUX (since
>> this cost depends greatly on the simd platform) ?
>> And then do a natural heuristic rather than a quick hack like we used to have?
>>
>> Benoit
>>
>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> Hi Gael *cough* List,
>>>
>>> ei_predux is costly because it consists of >1 SIMD instruction.
>>>
>>> So until recently we had sum() only vectorize if the size was big enough.
>>> However this was recently changed.
>>>
>>> Attached is a benchmark that runs 2.5x slower with SSE (2 or 3) than
>>> without. It's just Vector2d::sum().
>>>
>>> So, revert to old behavior?
>>>
>>> Moreover: matrix product innerVectorization also uses a ei_predux. Same here?
>>>
>>> Cheers,
>>> Benoit
>>>
>>
>
#define EIGEN_UNROLLING_LIMIT 10000
#include<Eigen/Core>

typedef Eigen::Matrix<double, 64, 1> T;

int main()
{
  T v; v.setZero(); v[0] = 1;
  for(int i = 0; i < 10000000; i++)
  {
    v = T::Ones() + v * 1e-10;
    asm("#a");
    v[0] = v.sum();
    asm("#b");
    v[1] = v.sum();
    //std::cout << v << "\n"; // check it's not inf...
  }
  return int(v[0]);
}


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/