Re: [eigen] Vectorized quaternion multiplication. |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Vectorized quaternion multiplication.
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Sat, 7 Mar 2009 12:46:05 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=WeItFDTEkSjkDBu3wurkWclsGlEmpR+U35OQuN4ILjs=; b=h8OUzx2j5E2LEq+ssgrn6ihsE8dMFbgzNlWcQQPrCTDjnL4utMae53RgD9YQYeV30X 93Dve2x5DYMWdrLaszjEZ6bqqyE9X3GWX6jpnBH2JuCL10h7E9kIV89lFzCTl34oMENJ pt7sq4kRlbWZpUhclnEJmpASDtUR9takXm4iw=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=UjFFdxbGXXvetv1zCpZfHfHsNoJrjdQ6LOhqRTIV0bPiAv4XYbu2B9kQI95iiY/UnU cXOK+9E2YN6q9MMsrlCkw+starIBq064juGJ9KRDM9jwee7XOFSoGfnHQsG733m9mnr5 9I2jfzHFPfbHgdUwiHpqFJMWI1fZzt3O43H7g=
hi,
thanks a lot,
at a first glance I was not sure about the perf, because it needs a
lot of shuffle instructions which are quite costly, so benched, and on
my core2 your version is 1.5 times faster :) Then I changed the
shuffle_ps for the simpler PSHUFD instr. and now, it is almost 2x
faster, so really worth it :)
FYI I only changed vec4f_swizzle like this:
#define vec4f_swizzle(v,p,q,r,s) (_mm_castsi128_ps(_mm_shuffle_epi32(
_mm_castps_si128(v), \
((s)<<6|(r)<<4|(q)<<2|(p)))))
cheers,
gael
On Sat, Mar 7, 2009 at 10:24 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
> Hi,
>
> The attached file has a the code for vectorized quaternion
> multiplication. (SSE only). I have not made a patch because I do not
> have access to svn. My school's proxy apparently doesn't play nice
> with svn. The results match against eigen's results. The convention is
> that x,y,z are stored first and then the scalar part.
>
> The assembly is 23 instructions, but they are highly pipelined.
>
> 00000000004009e0 <_Z8quat_mulU8__vectorfS_>:
> 4009e0: 0f 28 f9 movaps %xmm1,%xmm7
> 4009e3: 0f 28 f1 movaps %xmm1,%xmm6
> 4009e6: 0f 28 d8 movaps %xmm0,%xmm3
> 4009e9: 0f 28 e9 movaps %xmm1,%xmm5
> 4009ec: 0f 28 d0 movaps %xmm0,%xmm2
> 4009ef: 0f c6 f9 ff shufps $0xff,%xmm1,%xmm7
> 4009f3: 0f c6 f1 09 shufps $0x9,%xmm1,%xmm6
> 4009f7: 0f c6 e9 64 shufps $0x64,%xmm1,%xmm5
> 4009fb: 0f 28 e0 movaps %xmm0,%xmm4
> 4009fe: 0f c6 d8 7f shufps $0x7f,%xmm0,%xmm3
> 400a02: 0f c6 d0 89 shufps $0x89,%xmm0,%xmm2
> 400a06: 0f c6 c9 92 shufps $0x92,%xmm1,%xmm1
> 400a0a: 0f c6 e0 12 shufps $0x12,%xmm0,%xmm4
> 400a0e: 0f 59 dd mulps %xmm5,%xmm3
> 400a11: 0f 59 d1 mulps %xmm1,%xmm2
> 400a14: 0f 28 0d 45 0f 20 00 movaps 0x200f45(%rip),%xmm1 #
> 601960 <_ZL9quat_mask>
> 400a1b: 0f 59 e6 mulps %xmm6,%xmm4
> 400a1e: 0f 59 c7 mulps %xmm7,%xmm0
> 400a21: 0f 57 d1 xorps %xmm1,%xmm2
> 400a24: 0f 57 d9 xorps %xmm1,%xmm3
> 400a27: 0f 5c c4 subps %xmm4,%xmm0
> 400a2a: 0f 58 d3 addps %xmm3,%xmm2
> 400a2d: 0f 58 c2 addps %xmm2,%xmm0
> 400a30: c3 retq
> 400a31: 66 66 66 66 66 66 2e nopw %cs:0x0(%rax,%rax,1)
> 400a38: 0f 1f 84 00 00 00 00
> 400a3f: 00
>
> I checked for eigen's code, and it has ~55 instructions, so clearly a
> win. With this I hope the major arguments against quaternion class
> being scalar willl go away. If some can build a scaffolding for simd
> quaternion code. I'll be happy to chip in with the optimized
> implementations for the stuff. Add, subtract should be fairly trivial.
> I am interested in implementing a fast quaternion slerp as well. Fast
> == approximate.
>
> On the topic of complex math. I feel Eigen should make specialized
> classes. It is highly unlikely that gcc will be able to vectorize
> complex mutiplication anytime soon.
>
> Regards,
> --
> Rohit Garg
>
> http://rpg-314.blogspot.com/
>
> Senior Undergraduate
> Department of Physics
> Indian Institute of Technology
> Bombay
>