[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] two things
- From: "Gael Guennebaud" <gael.guennebaud@xxxxxxxxx>
- Date: Thu, 26 Jun 2008 19:41:13 +0200
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=Q/CQ7NEwMhxLaH7jaFpYEnGFRDmgzOiXjwalYmb5ufY=; b=BGt5ufO76w1kYwfzqiFbqn5ATnMvIQb6U8lZ2MGDIO6nbuCCnInC8Bu3by5nJF5WjK sg9uVlFbfVX6EBHgimigwwXVfLgMlxVMgmutnzuLL0ONI8npqy3BvD7HNZAETLkyp7LA FNFAA860UASvSgYSw1HTi/v69HKB8DWdCCn4k=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=tSwno5LM9NAfErCAYRIGdJi3TodwLPzBv8npDlz+ScAOVDCWb2PL+2aP9k/yr6PdaW XlaToVUIRfKGR57a+jrmp+33wp1FEl6i5nYaVpVqhiCPI0nTvQPvCM74PUVfSmI981vu JOb1CJcgnW12S56to5r3j90VNFe+NOFbloTxI=
On Thu, Jun 26, 2008 at 7:26 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:
> On Thursday 26 June 2008 18:55:22 Gael Guennebaud wrote:
>> yes, exactly. but I'm still puzzled by these results since on a 2GHz
>> core2 we could expect a peak performance of 8 GFlops and we are far
>> far away. I've also tried c = a + b; => even slower. On the other hand
>> with a += a; I could reach ~ 4.5 GFlops . For comparison purpose, our
>> optimized matrix product on 1024x1024 matrices achieve ~9 GFlops ! yes
>> 9 ! this is because the CPU can does an "add" and a "mul" at the same
>> time... I guess the trick would be to do some prefetching but I did
>> not manage to get any improvements so far...
> I was thinking the same;
> Here is what the critical loop looks like in assembly:
> movaps (%edx,%eax,4), %xmm0
> addps (%esi,%eax,4), %xmm0
> movaps %xmm0, (%edx,%eax,4)
> addl $4, %eax
> cmpl %eax, %ecx
> jg .L68
> So, for one productive instruction (the addps) there are 2 mov instructions
> (and i don't could the 3 last instructions which go away once we peel the
> loop). Could that somehow be improved?
the only way I know to improve that is to do loop peeling and reduce
the dependency between the instructions... basically this is what I
tried to do in benchAddVec... but I only get improvement from the loop
> By the way, I tried this benchmark without vectorization, and got 0.4 GFlops
> at 400x400 size (where the cost of not linearizing is negligible) so the
> benefit of vectorization here is somewhere between +25% and +50%.
yes the improvement of the vectorization is very low here... for me it
is ~ 25% :( This is because we are limited by memory accesses.
> By comparison, I made a simple benchmark for sum() of a big float vector
> (really just modifying vdw_new). There, vectorization speeds up by 4x; and
> when it is enabled I get 1.7 GFlop (counting 1G = 10^9) on my 1.66 GHz CPU.
> So, much better. Not the theoretical maximum, but since this benchmark is
> memory intensive, doing only one add per loaded number, I can believe that
> 1.7 GFlop is all what my laptop's memory allows. Perhaps the better flops in
> the matrix product is because (especially with your cache-friendly code) it
> is more computation intensive relatively to the amount of memory accesses.
yes this example is even more favorable than a += a; because there is
no store, only a single load. Actually the core of the matrix product
is quite similar to .sum() that explains why it works much better.
> Here is the performance-critical part of that sum() benchmark:
> addps (%ebx,%eax,4), %xmm1
> addl $4, %eax
> cmpl %eax, %edx
> jg .L18