[eigen] Re: sse4 and integer multiplication

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Anyway, that wasn't the reason why the non-vectorized code is faster.
New code:



#include <Eigen/Dense>
using namespace Eigen;
using namespace std;

EIGEN_DONT_INLINE void foo(const Vector4i& w)
{
  Vector4i v(5,-7,11,13);

  for(int i = 0; i<100000000; i++)
  {
    EIGEN_ASM_COMMENT("begin");
    v = v.cwise()*v;
    v = v.cwise()*w;
    EIGEN_ASM_COMMENT("end");
  }
  cout << v << endl;
}

int main()
{
  foo(Vector4i(91,39,-53,-79));
}






Non-vectorized:



	imull	%ebx, %ebx
	imull	%ecx, %ecx
	imull	%edx, %edx
	imull	%eax, %eax
	imull	%r9d, %ebx
	imull	%r8d, %ecx
	imull	%ebp, %edx
	imull	%edi, %eax


With sse 4.1:

	movdqa	%xmm1, %xmm0
	pmulld	%xmm1, %xmm0
	pmulld	%xmm2, %xmm0
	movdqa	%xmm0, %xmm1
	movdqa	%xmm0, (%rbp)


The speed difference is even bigger:

non-vectorized:   0.47s
with sse 4.1:       0.81s

I'm especially puzzled as even the number of instructions is smaller with sse!

If I add more code like an addition, then the speed difference becomes
much smaller, but non-vectorized remains faster.

Benoit

2009/11/24 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> Hi,
>
> i just added SSE4 integer mul support. It is an improvement over the
> current vectorized integer multiplication where SSE4 is available, but
> i am puzzled: here is my benchmark:
>
>
>
> #include <Eigen/Dense>
> using namespace Eigen;
> using namespace std;
>
> EIGEN_DONT_INLINE void foo()
> {
>  // i was wondering if the cpu could be clever enough to
>  // optimize when the ints are 0 or 1; it's not so easy to
>  // ensure that we don't end up with only 0 and 1...
>
>  Vector4i v(5,-7,11,13);
>  Vector4i w(9,3,-5,-7);
>
>  for(int i = 0; i<100000000; i++)
>  {
>    EIGEN_ASM_COMMENT("begin");
>    v = v.cwise()*v;
>    v = v.cwise()*w;
>    EIGEN_ASM_COMMENT("end");
>  }
>  cout << v << endl;
> }
>
> int main()
> {
>  foo();
> }
>
>
>
> OK so i'm puzzled because the fastest is... with no vectorization at all.
>
> No vectorization:   0.57 sec
> With SSE4.1:      0.81 sec
> With SSE2:         1.21 sec
>
> So i did what i usually do in such circumstances: dump the assembly
> and go whine until daddy Gael takes care of me.
>
> Without vec:
>
>        imull   %edx, %edx
>        imull   %eax, %eax
>        leal    0(,%rdx,8), %edi
>        imull   %ebx, %ebx
>        leal    (%rax,%rax,4), %eax
>        imull   %ecx, %ecx
>        subl    %edi, %edx
>        negl    %eax
>        leal    (%rbx,%rbx,8), %ebx
>        leal    (%rcx,%rcx,2), %ecx
>
> With SSE4.1:
>
>        movdqa  %xmm1, %xmm0
>        pmulld  %xmm1, %xmm0
>        pmulld  (%rdx), %xmm0
>        movdqa  %xmm0, %xmm1
>        movdqa  %xmm0, (%rbp)
>
> Cheers,
> Benoit
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/