[eigen] Re: sse4 and integer multiplication |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
- Subject: [eigen] Re: sse4 and integer multiplication
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Tue, 24 Nov 2009 15:36:59 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=guRK/ktULji/yku0QdmDMHLtoRzHzAulcdzPi19heI4=; b=xN3WdQlZzQklJ0hr1D4Wl2r0IF3yPUCw2Z9KTiM0VxC+ssPjEqHRy+A0la2KhvS6KR +qCJKj8Rlx8wihSb83acgGwXcx6ssK6TO2oM8kY8PGJcqvoMhWsZ8MuEn1/MBa+gvHwa sp+xP1+BWsIBFyjojn0nPBAbNrA3elSNMdIGA=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=D9blUtAfpub1ero1rnaImCyttFf4de3Agy6XPNQ0su5gZktSxkQVVYWNS+OOojsMhK hVFMH8i6hTMXqNZZgpzsJ8DMqQdR+wVcXV8KENMCCEAUS0Z6hiuUssKDlNyjWkHUO9mt Ww4oOkwsb15aglnad8wChR+lrdA3g39cR/FNY=
Anyway, that wasn't the reason why the non-vectorized code is faster.
New code:
#include <Eigen/Dense>
using namespace Eigen;
using namespace std;
EIGEN_DONT_INLINE void foo(const Vector4i& w)
{
Vector4i v(5,-7,11,13);
for(int i = 0; i<100000000; i++)
{
EIGEN_ASM_COMMENT("begin");
v = v.cwise()*v;
v = v.cwise()*w;
EIGEN_ASM_COMMENT("end");
}
cout << v << endl;
}
int main()
{
foo(Vector4i(91,39,-53,-79));
}
Non-vectorized:
imull %ebx, %ebx
imull %ecx, %ecx
imull %edx, %edx
imull %eax, %eax
imull %r9d, %ebx
imull %r8d, %ecx
imull %ebp, %edx
imull %edi, %eax
With sse 4.1:
movdqa %xmm1, %xmm0
pmulld %xmm1, %xmm0
pmulld %xmm2, %xmm0
movdqa %xmm0, %xmm1
movdqa %xmm0, (%rbp)
The speed difference is even bigger:
non-vectorized: 0.47s
with sse 4.1: 0.81s
I'm especially puzzled as even the number of instructions is smaller with sse!
If I add more code like an addition, then the speed difference becomes
much smaller, but non-vectorized remains faster.
Benoit
2009/11/24 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> Hi,
>
> i just added SSE4 integer mul support. It is an improvement over the
> current vectorized integer multiplication where SSE4 is available, but
> i am puzzled: here is my benchmark:
>
>
>
> #include <Eigen/Dense>
> using namespace Eigen;
> using namespace std;
>
> EIGEN_DONT_INLINE void foo()
> {
> // i was wondering if the cpu could be clever enough to
> // optimize when the ints are 0 or 1; it's not so easy to
> // ensure that we don't end up with only 0 and 1...
>
> Vector4i v(5,-7,11,13);
> Vector4i w(9,3,-5,-7);
>
> for(int i = 0; i<100000000; i++)
> {
> EIGEN_ASM_COMMENT("begin");
> v = v.cwise()*v;
> v = v.cwise()*w;
> EIGEN_ASM_COMMENT("end");
> }
> cout << v << endl;
> }
>
> int main()
> {
> foo();
> }
>
>
>
> OK so i'm puzzled because the fastest is... with no vectorization at all.
>
> No vectorization: 0.57 sec
> With SSE4.1: 0.81 sec
> With SSE2: 1.21 sec
>
> So i did what i usually do in such circumstances: dump the assembly
> and go whine until daddy Gael takes care of me.
>
> Without vec:
>
> imull %edx, %edx
> imull %eax, %eax
> leal 0(,%rdx,8), %edi
> imull %ebx, %ebx
> leal (%rax,%rax,4), %eax
> imull %ecx, %ecx
> subl %edi, %edx
> negl %eax
> leal (%rbx,%rbx,8), %ebx
> leal (%rcx,%rcx,2), %ecx
>
> With SSE4.1:
>
> movdqa %xmm1, %xmm0
> pmulld %xmm1, %xmm0
> pmulld (%rdx), %xmm0
> movdqa %xmm0, %xmm1
> movdqa %xmm0, (%rbp)
>
> Cheers,
> Benoit
>