Re: [eigen] optimization question |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] optimization question*From*: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>*Date*: Tue, 4 Oct 2011 11:36:04 +0200*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=VVRRlyTrbo2L9TjV03WdOtvK3r4i0eSXLayAaJmoycc=; b=wJAXM24vycsikMMWChBeIcYSJMZXPgTnlKnVDs1GHdGgeE0Pqs3Xg6k/krDtKkKGtO CAHVSar29sYLSLk4g+jznq4rFXqLt3JWLWvmAb4IWQezd7s2KdBKxCRc83STEY7FrJLO nueBiLXLQrQJ2Hz/gX5ARN36sJblX6cyBAwf8=

as Benoit said, this example is memory bounds, 3 memory accesses for 1 arithmetic operation. Explicit prefetching won't help. You will probably earn a few % once we got loop peeling, that requires "meta Packets" containing multiple real packet. So not soon ;) Indeed, naively unrolling the evaluation loop won't help here because the compiler will still use a single register. Gael. On Tue, Oct 4, 2011 at 3:59 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote: > 2011/10/3 Michel <michel.pacilli@xxxxxxx>: >> Hi, >> >> not sure if it's the good place to ask user question... tell me if so. >> >> Well I try to get the best of eigen simple example, and I'm not sure that I >> get the most : >> >> #define N 32768 >> >> Matrix<float,N,1> u; > > Are you really sure that you want this? For such a large size, it is > almost always a better idea to use a MatrixXf u(N). > >> >> Matrix<float,N,1> v; >> >> Matrix<float,N,1> w; >> >> for(int k=0; k <NLOOP; ++k) >> >> u = v.array() * w.array(); >> >> compile with gcc and sse2 flag >> >> Well, compare to a simple for loop and aligned array, I've got around 17% >> speed up with eigen ;) >> but, is it possible to give at compile time some hints to go further, with >> unrolling, sse3,4? or other things? > > I don't think that newer sse versions bring anything useful here. > Actually, fwiw, sse1 would already be enough for this particular use > case! > >> >> the asm of product is: >> >> # 86 "..\eigen\main.cpp" 1 >> #it begins here! >> # 0 "" 2 >> /NO_APP >> xorl %eax, %eax >> .p2align 4,,10 >> L3: >> movaps (%esi,%eax,4), %xmm0 >> mulps (%ebx,%eax,4), %xmm0 >> movaps %xmm0, (%edx,%eax,4) >> addl $4, %eax >> cmpl $32768, %eax >> jne L3 >> /APP >> # 88 "..\eigen\main.cpp" 1 >> #it ends here! >> >> I wonder if it could be more efficient with more than just one xmm reg, or >> prefetch ? > > I can only see 1 xmm register here, and given the very simple and > predictable access pattern, there shouldn't be any reason to use > explicit prefetch instructions. > > The only further optimization that I would consider here, would be > partial unrolling of this loop, try doing 2 or 4 iterations at a time. > > Benoit > > >

**References**:**AW: [eigen] New release?***From:*Schmidt, Michael

**[eigen] optimization question***From:*Michel

**Re: [eigen] optimization question***From:*Benoit Jacob

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] optimization question** - Next by Date:
**[eigen] Mercurial EOL extension and mpreal.h** - Previous by thread:
**Re: [eigen] optimization question** - Next by thread:
**[eigen] Mercurial EOL extension and mpreal.h**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |