Re: [eigen] Nesting by reference of by value ?

Now I did firther tests with

a) Matrix * Matrix product
b) Scalar * Matrix product

for those tests, the generated assembly is identical. For the remaining tests where I worked with Matrix2Xd, Matrix2d and Vector2d, I only investigated the run-times. The assembly is not completely inlined and rather long which makes a comparison difficult. The run-times for those tests seem to be slightly better for the version using nesting by value.

Three runs led to the following results (the time difference is not random but repeatable)

nest by value.: 21065.2 ms,20945.9 ms,21248.7 ms
nest by ref.....: 21740.7 ms, 21920.5 ms, 21573.6 ms

Currently I am assuming that the same will probably hold for VC10 - those tests are still pending.

- Hauke

On Fri, Nov 20, 2009 at 11:21 AM, Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx> wrote:
I did the Vector addition test over here and the results are as follows:

I) The old version

    EIGEN_DONT_INLINE static void run(int num_runs)
0000000140001330  sub         rsp,28h
        VectorType a,b,c,d;

        c = a+b+c+d;
0000000140001334  lea         rax,[rsp]
0000000140001338  mov         qword ptr [rsp+8],rax
000000014000133D  lea         rax,[rsp]
0000000140001341  mov         qword ptr [rsp+10h],rax
0000000140001346  lea         rax,[rsp]
000000014000134A  mov         qword ptr [rsp+8],rax
000000014000134F  lea         rax,[rsp]
0000000140001353  mov         qword ptr [rsp+10h],rax
0000000140001358  movaps      xmm0,xmmword ptr [rsp]
000000014000135C  addps       xmm0,xmmword ptr [rsp]
0000000140001360  addps       xmm0,xmmword ptr [rsp]
0000000140001364  addps       xmm0,xmmword ptr [rsp]
0000000140001368  movaps      xmmword ptr [rsp],xmm0
000000014000136C  add         rsp,28h
0000000140001370  ret 

II) The new version

    EIGEN_DONT_INLINE static void run(int num_runs)
0000000140001330  sub         rsp,18h
        VectorType a,b,c,d;

        c = a+b+c+d;
0000000140001334  lea         rax,[rsp]
0000000140001338  lea         rcx,[rsp]
000000014000133C  movaps      xmm0,xmmword ptr [rax]
000000014000133F  addps       xmm0,xmmword ptr [rcx]
0000000140001342  addps       xmm0,xmmword ptr [rsp]
0000000140001346  addps       xmm0,xmmword ptr [rsp]
000000014000134A  movaps      xmmword ptr [rsp],xmm0
000000014000134E  add         rsp,18h
0000000140001352  ret 

My test base is attached. You can see, that nesting by value seems to improve the code a lot - currently I tested with VC9. I might try VC10 too.

Actually, I think I might update the wiki to show how to properly isolate assembly code with Visual Studio.

- Hauke

On Wed, Nov 18, 2009 at 6:43 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:


I've just played a bit with Hauke's nesting refactoring fork (

Let me recall that currently expressions are nested by reference that enforces the use NestByValue when a function has to return a nested _expression_. See for instance adjoint() which returns Transpose<NestByValue<CwiseUnaryOp<ei_scalar_conjugate<Scalar>, Derived> > >. As you can see this is pretty annoying. In Hauke's fork lightweight expressions (i.e., all but Matrix) are automatically nested by value. So need for the NestByValue workaround.

So now the question is what about the performances ? Well I tried a very simple example:

Vector4f a, b, c, d;
c = a+b+c+d;

and here are the respective assembly codes generated by g++ 4.3.3 (-O2 -DNDEBUG):

** Nest by reference: **

    movaps    112(%rsp), %xmm0
    leaq    112(%rsp), %rax
    leaq    80(%rsp), %rsi
    movb    $0, 48(%rsp)
    movb    $0, 16(%rsp)
    addps    96(%rsp), %xmm0
    movq    %rax, 32(%rsp)
    leaq    96(%rsp), %rax
    movq    %rsi, 8(%rsp)
    movq    %rax, 40(%rsp)
    leaq    32(%rsp), %rax
    movq    %rax, (%rsp)
    addps    80(%rsp), %xmm0
    addps    64(%rsp), %xmm0
    movaps    %xmm0, 80(%rsp)

** Nest by value: **

    movaps    208(%rsp), %xmm0
    leaq    208(%rsp), %rcx
    movq    80(%rsp), %rax
    movb    $0, 104(%rsp)
    leaq    192(%rsp), %rdx
    addps    192(%rsp), %xmm0
    leaq    176(%rsp), %rsi
    movq    %rcx, 128(%rsp)
    movq    %rdx, 136(%rsp)
    movq    %rax, 8(%rsp)
    movq    104(%rsp), %rax
    movb    $0, 144(%rsp)
    movq    %rcx, 88(%rsp)
    movq    %rdx, 96(%rsp)
    movq    %rsi, 112(%rsp)
    movb    $0, 120(%rsp)
    movq    %rcx, 16(%rsp)
    movq    %rdx, 24(%rsp)
    movq    %rax, 32(%rsp)
    addps    176(%rsp), %xmm0
    movq    %rsi, 40(%rsp)
    movb    $0, 48(%rsp)
    addps    160(%rsp), %xmm0
    movaps    %xmm0, 176(%rsp)

So clearly, gcc has a lot of difficulties to optimize this simple code. In both cases we can see a lot of useless copies from the stack to the stack, but the situation with the nesting by value is much much worse, unfortunately.

If I have time I'll do more experiments.


