Re: [eigen] two things |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] two things*From*: Benoît Jacob <jacob@xxxxxxxxxxxxxxx>*Date*: Thu, 26 Jun 2008 19:26:55 +0200

On Thursday 26 June 2008 18:55:22 Gael Guennebaud wrote: > yes, exactly. but I'm still puzzled by these results since on a 2GHz > core2 we could expect a peak performance of 8 GFlops and we are far > far away. I've also tried c = a + b; => even slower. On the other hand > with a += a; I could reach ~ 4.5 GFlops . For comparison purpose, our > optimized matrix product on 1024x1024 matrices achieve ~9 GFlops ! yes > 9 ! this is because the CPU can does an "add" and a "mul" at the same > time... I guess the trick would be to do some prefetching but I did > not manage to get any improvements so far... I was thinking the same; Here is what the critical loop looks like in assembly: .L68: movaps (%edx,%eax,4), %xmm0 addps (%esi,%eax,4), %xmm0 movaps %xmm0, (%edx,%eax,4) addl $4, %eax cmpl %eax, %ecx jg .L68 So, for one productive instruction (the addps) there are 2 mov instructions (and i don't could the 3 last instructions which go away once we peel the loop). Could that somehow be improved? By the way, I tried this benchmark without vectorization, and got 0.4 GFlops at 400x400 size (where the cost of not linearizing is negligible) so the benefit of vectorization here is somewhere between +25% and +50%. By comparison, I made a simple benchmark for sum() of a big float vector (really just modifying vdw_new). There, vectorization speeds up by 4x; and when it is enabled I get 1.7 GFlop (counting 1G = 10^9) on my 1.66 GHz CPU. So, much better. Not the theoretical maximum, but since this benchmark is memory intensive, doing only one add per loaded number, I can believe that 1.7 GFlop is all what my laptop's memory allows. Perhaps the better flops in the matrix product is because (especially with your cache-friendly code) it is more computation intensive relatively to the amount of memory accesses. Here is the performance-critical part of that sum() benchmark: .L18: addps (%ebx,%eax,4), %xmm1 addl $4, %eax cmpl %eax, %edx jg .L18 Cheers, Benoit

**Attachment:
signature.asc**

**Follow-Ups**:**Re: [eigen] two things***From:*Gael Guennebaud

**References**:**[eigen] two things***From:*Benoît Jacob

**Re: [eigen] two things***From:*Benoît Jacob

**Re: [eigen] two things***From:*Gael Guennebaud

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] two things** - Next by Date:
**Re: [eigen] two things** - Previous by thread:
**Re: [eigen] two things** - Next by thread:
**Re: [eigen] two things**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |