Re: [eigen] New(?) way to make using SIMD easier

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I'm also surprised by your results, so complete the stats here is what I get (core2, 64bits):

g++-4.4 testmap.cc -I.. -O2 -DNDEBUG && ./a.out
 With simple function, iterations=6000000, elements=512 took 1.12206s. rate=2737.82 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 0.857663s. rate=3581.83 MS/s
 With simple function, iterations=6000000, elements=512 took 1.07977s. rate=2845.04 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 0.851847s. rate=3606.28 MS/s

g++-4.3 testmap.cc -I.. -O2 -DNDEBUG && ./a.out
 With simple function, iterations=6000000, elements=512 took 1.114s. rate=2757.63 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 0.965249s. rate=3182.6 MS/s
 With simple function, iterations=6000000, elements=512 took 1.07293s. rate=2863.18 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 0.924741s. rate=3322.01 MS/s

and in 32 bit mode:

g++-4.3 testmap.cpp -I.. -O2 -DNDEBUG -m32 -msse2 && ./a..out
 With simple function, iterations=6000000, elements=512 took 1.51738s. rate=2024.54 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 1.62693s. rate=1888.22 MS/s
 With simple function, iterations=6000000, elements=512 took 1.46279s. rate=2100.1 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 1.61396s. rate=1903.39 MS/s

g++-4.4 testmap.cpp -I.. -O2 -DNDEBUG -m32 -msse2 && ./a.out
 With simple function, iterations=6000000, elements=512 took 1.15444s. rate=2661.03 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 0.869218s. rate=3534.21 MS/s
 With simple function, iterations=6000000, elements=512 took 1.0694s. rate=2872.64 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 0.867609s. rate=3540.77 MS/s

Here, Eigen's version is clearly quite faster....

gael.

On Wed, Nov 25, 2009 at 10:08 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
2009/11/25 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
> On 11/25/2009 12:43 PM, Benoit Jacob wrote:
>>
>> 2009/11/25 Mark Borgerding<mark@xxxxxxxxxxxxxx>:
>>
>>>
>>> On 11/24/2009 11:51 AM, Benoit Jacob wrote:
>>>
>>>>
>>>> VectorXf::Map(dstPtr,num)
>>>>   = VectorXf::Map(srcPtr1,num)
>>>>   + VectorXf::Map(srcPtr2,num);
>>>>
>>>
>>> I love this syntax and was excited to start to use it more in some of our
>>> legacy code.
>>> ...
>>> Then,  I did a benchmark  comparing the speed of the above to that of a
>>> very
>>>  simple C-style function using SSE(see "vector_add" in attached
>>> testmap.cc).
>>> The simple function was *much* faster with both the intel compiler (11.0
>>> 20081105)  and with g++ (4.4.1 20090725). See the output below.
>>>
>>
>> That's because I forgot to tell you that when the pointers are known
>> to be aligned, you need to tell that to Eigen, otherwise it can't
>> guess it (at least not without incurring a constant overhead).
>>
>> So just use MapAligned() instead of Map()  (note: that requires the
>> development branch). Actually I tried and now it has exactly the same
>> speed as your simple version:
>>
>> $ g++ testmap.cc -I ../eigen -O2 -DNDEBUG -o t&&  ./t
>>
>
> You did not use any -msse* flags.  So neither version is using SIMD.

I am on a 64-bit machine, so SSE2 is implicit. Both versions are using
SIMD. Actually here is the assembly generated by the Eigen version:

       xorl    %eax, %eax
       .p2align 4,,10
       .p2align 3
..L21:
       movaps  (%rbp,%rax), %xmm0
       addps   (%r12,%rax), %xmm0
       movaps  %xmm0, (%rbx,%rax)
       addq    $16, %rax
       cmpq    $2048, %rax
       jne     .L21

> After switching to MapAligned ( from hg tip), it helped a little, but I
> still see almost a 2x difference.
>
> g++ -I.. -O3 -msse -msse2 -msse3    -c -o testmap.o testmap.cc
> g++ -o testmap testmap.o
> ./testmap
>  With simple function, iterations=6000000, elements=512 took 0.690981s.
> rate=4445.85 MS/s
>  With VectorXf::Map, iterations=6000000, elements=512 took 1.29193s.
> rate=2377.84 MS/s
>  With simple function, iterations=6000000, elements=512 took 0.671556s.
> rate=4574.45 MS/s
>  With VectorXf::Map, iterations=6000000, elements=512 took 1.27064s.
> rate=2417.67 MS/s

Strange. I can't reproduce this here, although i too have gcc 4.4,
even using the exact same command lines as you do.

Can you try -DNDEBUG ? Here it makes a small but noticeable difference.

Otherwise the most likely explanation is the difference between x86
and x86-64. Can you generate the asm and send it? Find attached a
modified source file to emit asm comments at the right place (like i
used above).

Cheers
Benoit






Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/