Re: [eigen] Re: sse4 and integer multiplication

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]




On Tue, Nov 24, 2009 at 9:36 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
Anyway, that wasn't the reason why the non-vectorized code is faster.
New code:



#include <Eigen/Dense>
using namespace Eigen;
using namespace std;

EIGEN_DONT_INLINE void foo(const Vector4i& w)
{
 Vector4i v(5,-7,11,13);

 for(int i = 0; i<100000000; i++)
 {
   EIGEN_ASM_COMMENT("begin");
   v = v.cwise()*v;
   v = v.cwise()*w;
   EIGEN_ASM_COMMENT("end");
 }
 cout << v << endl;
}

int main()
{
 foo(Vector4i(91,39,-53,-79));
}






Non-vectorized:



       imull   %ebx, %ebx
       imull   %ecx, %ecx
       imull   %edx, %edx
       imull   %eax, %eax
       imull   %r9d, %ebx
       imull   %r8d, %ecx
       imull   %ebp, %edx
       imull   %edi, %eax


With sse 4.1:

       movdqa  %xmm1, %xmm0
       pmulld  %xmm1, %xmm0
       pmulld  %xmm2, %xmm0
       movdqa  %xmm0, %xmm1
       movdqa  %xmm0, (%rbp)


The speed difference is even bigger:

non-vectorized:   0.47s
with sse 4.1:       0.81s

I'm especially puzzled as even the number of instructions is smaller with sse!

If I add more code like an addition, then the speed difference becomes
much smaller, but non-vectorized remains faster.

Benoit

2009/11/24 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> Hi,
>
> i just added SSE4 integer mul support. It is an improvement over the
> current vectorized integer multiplication where SSE4 is available, but
> i am puzzled: here is my benchmark:
>
>
>
> #include <Eigen/Dense>
> using namespace Eigen;
> using namespace std;
>
> EIGEN_DONT_INLINE void foo()
> {
>  // i was wondering if the cpu could be clever enough to
>  // optimize when the ints are 0 or 1; it's not so easy to
>  // ensure that we don't end up with only 0 and 1...
>
>  Vector4i v(5,-7,11,13);
>  Vector4i w(9,3,-5,-7);
>
>  for(int i = 0; i<100000000; i++)
>  {
>    EIGEN_ASM_COMMENT("begin");
>    v = v.cwise()*v;
>    v = v.cwise()*w;
>    EIGEN_ASM_COMMENT("end");
>  }
>  cout << v << endl;
> }
>
> int main()
> {
>  foo();
> }
>
>
>
> OK so i'm puzzled because the fastest is... with no vectorization at all.
>
> No vectorization:   0.57 sec
> With SSE4.1:      0.81 sec
> With SSE2:         1.21 sec
>
> So i did what i usually do in such circumstances: dump the assembly
> and go whine until daddy Gael takes care of me.
>
> Without vec:
>
>        imull   %edx, %edx
>        imull   %eax, %eax
>        leal    0(,%rdx,8), %edi
>        imull   %ebx, %ebx
>        leal    (%rax,%rax,4), %eax
>        imull   %ecx, %ecx
>        subl    %edi, %edx
>        negl    %eax
>        leal    (%rbx,%rbx,8), %ebx
>        leal    (%rcx,%rcx,2), %ecx
>
> With SSE4.1:
>
>        movdqa  %xmm1, %xmm0
>        pmulld  %xmm1, %xmm0
>        pmulld  (%rdx), %xmm0
>        movdqa  %xmm0, %xmm1
>        movdqa  %xmm0, (%rbp)

in the SSE4 version you have 2 unnecessary moves, one useless load, and one useless store. That's the main reason. Now why GCC  does not optimize them away, well I've no clue...


gael





Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/