Re: [eigen] two things

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] two things
From: Benoît Jacob <jacob@xxxxxxxxxxxxxxx>
Date: Thu, 26 Jun 2008 19:26:55 +0200

On Thursday 26 June 2008 18:55:22 Gael Guennebaud wrote:
> yes, exactly. but I'm still puzzled by these results since on a 2GHz
> core2  we could expect a peak performance of 8 GFlops and we are far
> far away. I've also tried c = a + b; => even slower. On the other hand
> with a += a;  I could reach ~ 4.5 GFlops . For comparison purpose, our
> optimized matrix product on 1024x1024 matrices achieve ~9 GFlops ! yes
> 9 ! this is because the CPU can does an "add" and a "mul" at the same
> time... I guess the trick would be to do some prefetching but I did
> not manage to get any improvements so far...

I was thinking the same;

Here is what the critical loop looks like in assembly:

.L68:
        movaps  (%edx,%eax,4), %xmm0
        addps   (%esi,%eax,4), %xmm0
        movaps  %xmm0, (%edx,%eax,4)
        addl    $4, %eax
        cmpl    %eax, %ecx
        jg      .L68

So, for one productive instruction (the addps) there are 2 mov instructions 
(and i don't could the 3 last instructions which go away once we peel the 
loop). Could that somehow be improved?

By the way, I tried this benchmark without vectorization, and got 0.4 GFlops 
at 400x400 size (where the cost of not linearizing is negligible) so the 
benefit of vectorization here is somewhere between +25% and +50%.

By comparison, I made a simple benchmark for sum() of a big float vector 
(really just modifying vdw_new). There, vectorization speeds up by 4x; and 
when it is enabled I get 1.7 GFlop (counting 1G = 10^9) on my 1.66 GHz CPU. 
So, much better. Not the theoretical maximum, but since this benchmark is 
memory intensive, doing only one add per loaded number, I can believe that 
1.7 GFlop is all what my laptop's memory allows. Perhaps the better flops in 
the matrix product is because (especially with your cache-friendly code) it 
is more computation intensive relatively to the amount of memory accesses.

Here is the performance-critical part of that sum() benchmark:

.L18:
        addps   (%ebx,%eax,4), %xmm1
        addl    $4, %eax
        cmpl    %eax, %edx
        jg      .L18

Cheers,

Benoit

Attachment: signature.asc
Description: This is a digitally signed message part.

Follow-Ups:
- Re: [eigen] two things
  - From: Gael Guennebaud

References:
- [eigen] two things
  - From: Benoît Jacob
- Re: [eigen] two things
  - From: Benoît Jacob
- Re: [eigen] two things
  - From: Gael Guennebaud

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] two things
Next by Date: Re: [eigen] two things
Previous by thread: Re: [eigen] two things
Next by thread: Re: [eigen] two things

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/