Re: [eigen] two things

[ Thread Index | Date Index | More Archives ]

On Thu, Jun 26, 2008 at 7:26 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:
> On Thursday 26 June 2008 18:55:22 Gael Guennebaud wrote:
>> yes, exactly. but I'm still puzzled by these results since on a 2GHz
>> core2  we could expect a peak performance of 8 GFlops and we are far
>> far away. I've also tried c = a + b; => even slower. On the other hand
>> with a += a;  I could reach ~ 4.5 GFlops . For comparison purpose, our
>> optimized matrix product on 1024x1024 matrices achieve ~9 GFlops ! yes
>> 9 ! this is because the CPU can does an "add" and a "mul" at the same
>> time... I guess the trick would be to do some prefetching but I did
>> not manage to get any improvements so far...
> I was thinking the same;
> Here is what the critical loop looks like in assembly:
> .L68:
>        movaps  (%edx,%eax,4), %xmm0
>        addps   (%esi,%eax,4), %xmm0
>        movaps  %xmm0, (%edx,%eax,4)
>        addl    $4, %eax
>        cmpl    %eax, %ecx
>        jg      .L68
> So, for one productive instruction (the addps) there are 2 mov instructions
> (and i don't could the 3 last instructions which go away once we peel the
> loop). Could that somehow be improved?

the only way I know to improve that is to do loop peeling and reduce
the dependency between the instructions... basically this is what I
tried to do in benchAddVec... but I only get improvement from the loop

> By the way, I tried this benchmark without vectorization, and got 0.4 GFlops
> at 400x400 size (where the cost of not linearizing is negligible) so the
> benefit of vectorization here is somewhere between +25% and +50%.

yes the improvement of the vectorization is very low here... for me it
is ~ 25% :( This is because we are limited by memory accesses.

> By comparison, I made a simple benchmark for sum() of a big float vector
> (really just modifying vdw_new). There, vectorization speeds up by 4x; and
> when it is enabled I get 1.7 GFlop (counting 1G = 10^9) on my 1.66 GHz CPU.
> So, much better. Not the theoretical maximum, but since this benchmark is
> memory intensive, doing only one add per loaded number, I can believe that
> 1.7 GFlop is all what my laptop's memory allows. Perhaps the better flops in
> the matrix product is because (especially with your cache-friendly code) it
> is more computation intensive relatively to the amount of memory accesses.

yes this example is even more favorable than a += a; because there is
no store, only a single load. Actually the core of the matrix product
is quite similar to .sum() that explains why it works much better.

> Here is the performance-critical part of that sum() benchmark:
> .L18:
>        addps   (%ebx,%eax,4), %xmm1
>        addl    $4, %eax
>        cmpl    %eax, %edx
>        jg      .L18
> Cheers,
> Benoit

Mail converted by MHonArc 2.6.19+