Re: [eigen] Nesting by reference of by value ?

On Wed, Nov 18, 2009 at 6:43 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

Hi,

I've just played a bit with Hauke's nesting refactoring fork (https://bitbucket.org/hauke/nesting-refactoring/).

Let me recall that currently expressions are nested by reference that enforces the use NestByValue when a function has to return a nested _expression_. See for instance adjoint() which returns Transpose<NestByValue<CwiseUnaryOp<ei_scalar_conjugate<Scalar>, Derived> > >. As you can see this is pretty annoying. In Hauke's fork lightweight expressions (i.e., all but Matrix) are automatically nested by value. So need for the NestByValue workaround.

So now the question is what about the performances ? Well I tried a very simple example:

Vector4f a, b, c, d;
c = a+b+c+d;

and here are the respective assembly codes generated by g++ 4.3.3 (-O2 -DNDEBUG):

** Nest by reference: **

    movaps    112(%rsp), %xmm0
    leaq    112(%rsp), %rax
    leaq    80(%rsp), %rsi
    movb    $0, 48(%rsp)
    movb    $0, 16(%rsp)
    addps    96(%rsp), %xmm0
    movq    %rax, 32(%rsp)
    leaq    96(%rsp), %rax
    movq    %rsi, 8(%rsp)
    movq    %rax, 40(%rsp)
    leaq    32(%rsp), %rax
    movq    %rax, (%rsp)
    addps    80(%rsp), %xmm0
    addps    64(%rsp), %xmm0
    movaps    %xmm0, 80(%rsp)

** Nest by value: **

    movaps    208(%rsp), %xmm0
    leaq    208(%rsp), %rcx
    movq    80(%rsp), %rax
    movb    $0, 104(%rsp)
    leaq    192(%rsp), %rdx
    addps    192(%rsp), %xmm0
    leaq    176(%rsp), %rsi
    movq    %rcx, 128(%rsp)
    movq    %rdx, 136(%rsp)
    movq    %rax, 8(%rsp)
    movq    104(%rsp), %rax
    movb    $0, 144(%rsp)
    movq    %rcx, 88(%rsp)
    movq    %rdx, 96(%rsp)
    movq    %rsi, 112(%rsp)
    movb    $0, 120(%rsp)
    movq    %rcx, 16(%rsp)
    movq    %rdx, 24(%rsp)
    movq    %rax, 32(%rsp)
    addps    176(%rsp), %xmm0
    movq    %rsi, 40(%rsp)
    movb    $0, 48(%rsp)
    addps    160(%rsp), %xmm0
    movaps    %xmm0, 176(%rsp)

So clearly, gcc has a lot of difficulties to optimize this simple code. In both cases we can see a lot of useless copies from the stack to the stack, but the situation with the nesting by value is much much worse, unfortunately.

If I have time I'll do more experiments.

Gael.