Re: [eigen] Nesting by reference of by value ? |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
I did the Vector addition test over here and the results are as follows:
I) The old version
EIGEN_DONT_INLINE static void run(int num_runs)
{
0000000140001330 sub rsp,28h
VectorType a,b,c,d;0000000140001334 lea rax,[rsp]
c = a+b+c+d;
0000000140001338 mov qword ptr [rsp+8],rax
000000014000133D lea rax,[rsp]
0000000140001341 mov qword ptr [rsp+10h],rax
0000000140001346 lea rax,[rsp]
000000014000134A mov qword ptr [rsp+8],rax
000000014000134F lea rax,[rsp]
0000000140001353 mov qword ptr [rsp+10h],rax
0000000140001358 movaps xmm0,xmmword ptr [rsp]
000000014000135C addps xmm0,xmmword ptr [rsp]
0000000140001360 addps xmm0,xmmword ptr [rsp]
0000000140001364 addps xmm0,xmmword ptr [rsp]
0000000140001368 movaps xmmword ptr [rsp],xmm0
}
000000014000136C add rsp,28h
0000000140001370 ret
II) The new version
EIGEN_DONT_INLINE static void run(int num_runs)
{
0000000140001330 sub rsp,18h
VectorType a,b,c,d;0000000140001334 lea rax,[rsp]
c = a+b+c+d;
0000000140001338 lea rcx,[rsp]
000000014000133C movaps xmm0,xmmword ptr [rax]
000000014000133F addps xmm0,xmmword ptr [rcx]
0000000140001342 addps xmm0,xmmword ptr [rsp]
0000000140001346 addps xmm0,xmmword ptr [rsp]
000000014000134A movaps xmmword ptr [rsp],xmm0
}
000000014000134E add rsp,18h
0000000140001352 ret
My test base is attached. You can see, that nesting by value seems to improve the code a lot - currently I tested with VC9. I might try VC10 too.
Actually, I think I might update the wiki to show how to properly isolate assembly code with Visual Studio.
- HaukeOn Wed, Nov 18, 2009 at 6:43 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:
Hi,
I've just played a bit with Hauke's nesting refactoring fork (https://bitbucket.org/hauke/nesting-refactoring/).
Let me recall that currently expressions are nested by reference that enforces the use NestByValue when a function has to return a nested _expression_. See for instance adjoint() which returns Transpose<NestByValue<CwiseUnaryOp<ei_scalar_conjugate<Scalar>, Derived> > >. As you can see this is pretty annoying. In Hauke's fork lightweight expressions (i.e., all but Matrix) are automatically nested by value. So need for the NestByValue workaround.
So now the question is what about the performances ? Well I tried a very simple example:
Vector4f a, b, c, d;
c = a+b+c+d;
and here are the respective assembly codes generated by g++ 4.3.3 (-O2 -DNDEBUG):
** Nest by reference: **
movaps 112(%rsp), %xmm0
leaq 112(%rsp), %rax
leaq 80(%rsp), %rsi
movb $0, 48(%rsp)
movb $0, 16(%rsp)
addps 96(%rsp), %xmm0
movq %rax, 32(%rsp)
leaq 96(%rsp), %rax
movq %rsi, 8(%rsp)
movq %rax, 40(%rsp)
leaq 32(%rsp), %rax
movq %rax, (%rsp)
addps 80(%rsp), %xmm0
addps 64(%rsp), %xmm0
movaps %xmm0, 80(%rsp)
** Nest by value: **
movaps 208(%rsp), %xmm0
leaq 208(%rsp), %rcx
movq 80(%rsp), %rax
movb $0, 104(%rsp)
leaq 192(%rsp), %rdx
addps 192(%rsp), %xmm0
leaq 176(%rsp), %rsi
movq %rcx, 128(%rsp)
movq %rdx, 136(%rsp)
movq %rax, 8(%rsp)
movq 104(%rsp), %rax
movb $0, 144(%rsp)
movq %rcx, 88(%rsp)
movq %rdx, 96(%rsp)
movq %rsi, 112(%rsp)
movb $0, 120(%rsp)
movq %rcx, 16(%rsp)
movq %rdx, 24(%rsp)
movq %rax, 32(%rsp)
addps 176(%rsp), %xmm0
movq %rsi, 40(%rsp)
movb $0, 48(%rsp)
addps 160(%rsp), %xmm0
movaps %xmm0, 176(%rsp)
So clearly, gcc has a lot of difficulties to optimize this simple code. In both cases we can see a lot of useless copies from the stack to the stack, but the situation with the nesting by value is much much worse, unfortunately.
If I have time I'll do more experiments.
Gael.
Attachment:
main.cpp
Description: Binary data
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |