Re: [eigen] optimization question

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] optimization question
From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
Date: Tue, 4 Oct 2011 11:36:04 +0200
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=VVRRlyTrbo2L9TjV03WdOtvK3r4i0eSXLayAaJmoycc=; b=wJAXM24vycsikMMWChBeIcYSJMZXPgTnlKnVDs1GHdGgeE0Pqs3Xg6k/krDtKkKGtO CAHVSar29sYLSLk4g+jznq4rFXqLt3JWLWvmAb4IWQezd7s2KdBKxCRc83STEY7FrJLO nueBiLXLQrQJ2Hz/gX5ARN36sJblX6cyBAwf8=

as Benoit said, this example is memory bounds, 3 memory accesses for 1
arithmetic operation. Explicit prefetching won't help. You will
probably earn a few % once we got loop peeling, that requires "meta
Packets" containing multiple real packet. So not soon ;) Indeed,
naively unrolling the evaluation loop won't help here because the
compiler will still use a single register.

Gael.

On Tue, Oct 4, 2011 at 3:59 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2011/10/3 Michel <michel.pacilli@xxxxxxx>:
>> Hi,
>>
>> not sure if it's the good place to ask user question... tell me if so.
>>
>> Well I try to get the best of eigen simple example, and I'm not sure that I
>> get the most :
>>
>> #define N 32768
>>
>> Matrix<float,N,1> u;
>
> Are you really sure that you want this? For such a large size, it is
> almost always a better idea to use a MatrixXf u(N).
>
>>
>> Matrix<float,N,1> v;
>>
>> Matrix<float,N,1> w;
>>
>> for(int k=0; k <NLOOP; ++k)
>>
>>    u = v.array() * w.array();
>>
>> compile with gcc and sse2 flag
>>
>> Well, compare to a simple for loop and aligned array, I've got around 17%
>> speed up with eigen ;)
>> but, is it possible to give at compile time some hints to go further, with
>> unrolling, sse3,4? or other things?
>
> I don't think that newer sse versions bring anything useful here.
> Actually, fwiw, sse1 would already be enough for this particular use
> case!
>
>>
>> the asm of product is:
>>
>>  # 86 "..\eigen\main.cpp" 1
>>       #it begins here!
>>  # 0 "" 2
>> /NO_APP
>>       xorl    %eax, %eax
>>       .p2align 4,,10
>> L3:
>>       movaps  (%esi,%eax,4), %xmm0
>>       mulps   (%ebx,%eax,4), %xmm0
>>       movaps  %xmm0, (%edx,%eax,4)
>>       addl    $4, %eax
>>       cmpl    $32768, %eax
>>       jne     L3
>> /APP
>>  # 88 "..\eigen\main.cpp" 1
>>       #it ends here!
>>
>> I wonder if it could be more efficient with more than just one xmm reg, or
>> prefetch ?
>
> I can only see 1 xmm register here, and given the very simple and
> predictable access pattern, there shouldn't be any reason to use
> explicit prefetch instructions.
>
> The only further optimization that I would consider here, would be
> partial unrolling of this loop, try doing 2 or 4 iterations at a time.
>
> Benoit
>
>
>

References:
- AW: [eigen] New release?
  - From: Schmidt, Michael
- [eigen] optimization question
  - From: Michel
- Re: [eigen] optimization question
  - From: Benoit Jacob

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] optimization question
Next by Date: [eigen] Mercurial EOL extension and mpreal.h
Previous by thread: Re: [eigen] optimization question
Next by thread: [eigen] Mercurial EOL extension and mpreal.h

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/