Re: [eigen] Vectorized quaternion multiplication.

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


hi,

thanks a lot,

at a first glance I was not sure about the perf, because it needs a
lot of shuffle instructions which are quite costly, so benched, and on
my core2 your version is 1.5 times faster :) Then I changed the
shuffle_ps for the simpler PSHUFD instr. and now, it is almost 2x
faster, so really worth it :)

FYI I only changed vec4f_swizzle like this:

#define vec4f_swizzle(v,p,q,r,s) (_mm_castsi128_ps(_mm_shuffle_epi32(
_mm_castps_si128(v), \
  ((s)<<6|(r)<<4|(q)<<2|(p)))))

cheers,
gael

On Sat, Mar 7, 2009 at 10:24 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
> Hi,
>
> The attached file has a the code for vectorized quaternion
> multiplication. (SSE only). I have not made a patch because I do not
> have access to svn. My school's proxy apparently doesn't play nice
> with svn. The results match against eigen's results. The convention is
> that x,y,z are stored first and then the scalar part.
>
> The assembly is 23 instructions, but they are highly pipelined.
>
> 00000000004009e0 <_Z8quat_mulU8__vectorfS_>:
>  4009e0:       0f 28 f9                movaps %xmm1,%xmm7
>  4009e3:       0f 28 f1                movaps %xmm1,%xmm6
>  4009e6:       0f 28 d8                movaps %xmm0,%xmm3
>  4009e9:       0f 28 e9                movaps %xmm1,%xmm5
>  4009ec:       0f 28 d0                movaps %xmm0,%xmm2
>  4009ef:       0f c6 f9 ff             shufps $0xff,%xmm1,%xmm7
>  4009f3:       0f c6 f1 09             shufps $0x9,%xmm1,%xmm6
>  4009f7:       0f c6 e9 64             shufps $0x64,%xmm1,%xmm5
>  4009fb:       0f 28 e0                movaps %xmm0,%xmm4
>  4009fe:       0f c6 d8 7f             shufps $0x7f,%xmm0,%xmm3
>  400a02:       0f c6 d0 89             shufps $0x89,%xmm0,%xmm2
>  400a06:       0f c6 c9 92             shufps $0x92,%xmm1,%xmm1
>  400a0a:       0f c6 e0 12             shufps $0x12,%xmm0,%xmm4
>  400a0e:       0f 59 dd                mulps  %xmm5,%xmm3
>  400a11:       0f 59 d1                mulps  %xmm1,%xmm2
>  400a14:       0f 28 0d 45 0f 20 00    movaps 0x200f45(%rip),%xmm1        #
> 601960 <_ZL9quat_mask>
>  400a1b:       0f 59 e6                mulps  %xmm6,%xmm4
>  400a1e:       0f 59 c7                mulps  %xmm7,%xmm0
>  400a21:       0f 57 d1                xorps  %xmm1,%xmm2
>  400a24:       0f 57 d9                xorps  %xmm1,%xmm3
>  400a27:       0f 5c c4                subps  %xmm4,%xmm0
>  400a2a:       0f 58 d3                addps  %xmm3,%xmm2
>  400a2d:       0f 58 c2                addps  %xmm2,%xmm0
>  400a30:       c3                              retq
>  400a31:       66 66 66 66 66 66 2e    nopw   %cs:0x0(%rax,%rax,1)
>  400a38:       0f 1f 84 00 00 00 00
>  400a3f:       00
>
> I checked for eigen's code, and it has ~55 instructions, so clearly a
> win. With this I hope the major arguments against quaternion class
> being scalar willl go away. If some can build a scaffolding for simd
> quaternion code. I'll be happy to chip in with the optimized
> implementations for the stuff. Add, subtract should be fairly trivial.
> I am interested in implementing a fast quaternion slerp as well. Fast
> == approximate.
>
> On the topic of complex math. I feel Eigen should make specialized
> classes. It is highly unlikely that gcc will be able to vectorize
> complex mutiplication anytime soon.
>
> Regards,
> --
> Rohit Garg
>
> http://rpg-314.blogspot.com/
>
> Senior Undergraduate
> Department of Physics
> Indian Institute of Technology
> Bombay
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/