[eigen] sse4 and integer multiplication

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi,

i just added SSE4 integer mul support. It is an improvement over the
current vectorized integer multiplication where SSE4 is available, but
i am puzzled: here is my benchmark:



#include <Eigen/Dense>
using namespace Eigen;
using namespace std;

EIGEN_DONT_INLINE void foo()
{
  // i was wondering if the cpu could be clever enough to
  // optimize when the ints are 0 or 1; it's not so easy to
  // ensure that we don't end up with only 0 and 1...

  Vector4i v(5,-7,11,13);
  Vector4i w(9,3,-5,-7);

  for(int i = 0; i<100000000; i++)
  {
    EIGEN_ASM_COMMENT("begin");
    v = v.cwise()*v;
    v = v.cwise()*w;
    EIGEN_ASM_COMMENT("end");
  }
  cout << v << endl;
}

int main()
{
  foo();
}



OK so i'm puzzled because the fastest is... with no vectorization at all.

No vectorization:   0.57 sec
With SSE4.1:      0.81 sec
With SSE2:         1.21 sec

So i did what i usually do in such circumstances: dump the assembly
and go whine until daddy Gael takes care of me.

Without vec:

	imull	%edx, %edx
	imull	%eax, %eax
	leal	0(,%rdx,8), %edi
	imull	%ebx, %ebx
	leal	(%rax,%rax,4), %eax
	imull	%ecx, %ecx
	subl	%edi, %edx
	negl	%eax
	leal	(%rbx,%rbx,8), %ebx
	leal	(%rcx,%rcx,2), %ecx

With SSE4.1:

	movdqa	%xmm1, %xmm0
	pmulld	%xmm1, %xmm0
	pmulld	(%rdx), %xmm0
	movdqa	%xmm0, %xmm1
	movdqa	%xmm0, (%rbp)

Cheers,
Benoit



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/