Re: [eigen] New(?) way to make using SIMD easier

On Wed, Nov 25, 2009 at 10:08 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:

2009/11/25 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
> On 11/25/2009 12:43 PM, Benoit Jacob wrote:
>>
>> 2009/11/25 Mark Borgerding<mark@xxxxxxxxxxxxxx>:
>>
>>>
>>> On 11/24/2009 11:51 AM, Benoit Jacob wrote:
>>>
>>>>
>>>> VectorXf::Map(dstPtr,num)
>>>> = VectorXf::Map(srcPtr1,num)
>>>> + VectorXf::Map(srcPtr2,num);
>>>>
>>>
>>> I love this syntax and was excited to start to use it more in some of our
>>> legacy code.
>>> ...
>>> Then, I did a benchmark comparing the speed of the above to that of a
>>> very
>>> simple C-style function using SSE(see "vector_add" in attached
>>> testmap.cc).
>>> The simple function was *much* faster with both the intel compiler (11.0
>>> 20081105) and with g++ (4.4.1 20090725). See the output below.
>>>
>>
>> That's because I forgot to tell you that when the pointers are known
>> to be aligned, you need to tell that to Eigen, otherwise it can't
>> guess it (at least not without incurring a constant overhead).
>>
>> So just use MapAligned() instead of Map() (note: that requires the
>> development branch). Actually I tried and now it has exactly the same
>> speed as your simple version:
>>
>> $ g++ testmap.cc -I ../eigen -O2 -DNDEBUG -o t&& ./t
>>
>
> You did not use any -msse* flags. So neither version is using SIMD.

I am on a 64-bit machine, so SSE2 is implicit. Both versions are using
SIMD. Actually here is the assembly generated by the Eigen version:

xorl %eax, %eax
.p2align 4,,10
.p2align 3
..L21:
movaps (%rbp,%rax), %xmm0
addps (%r12,%rax), %xmm0
movaps %xmm0, (%rbx,%rax)
addq $16, %rax
cmpq $2048, %rax
jne .L21

> After switching to MapAligned ( from hg tip), it helped a little, but I
> still see almost a 2x difference.
>
> g++ -I.. -O3 -msse -msse2 -msse3 -c -o testmap.o testmap.cc
> g++ -o testmap testmap.o
> ./testmap
> With simple function, iterations=6000000, elements=512 took 0.690981s.
> rate=4445.85 MS/s
> With VectorXf::Map, iterations=6000000, elements=512 took 1.29193s.
> rate=2377.84 MS/s
> With simple function, iterations=6000000, elements=512 took 0.671556s.
> rate=4574.45 MS/s
> With VectorXf::Map, iterations=6000000, elements=512 took 1.27064s.
> rate=2417.67 MS/s

Strange. I can't reproduce this here, although i too have gcc 4.4,
even using the exact same command lines as you do.

Can you try -DNDEBUG ? Here it makes a small but noticeable difference.

Otherwise the most likely explanation is the difference between x86
and x86-64. Can you generate the asm and send it? Find attached a
modified source file to emit asm comments at the right place (like i
used above).

Cheers
Benoit