Re: [eigen] optimization question
• To: eigen@xxxxxxxxxxxxxxxxxxx
• Subject: Re: [eigen] optimization question
• From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
• Date: Mon, 3 Oct 2011 21:59:06 -0400
• Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Q/m+JpiCZnXcuq0UIVyniOLuVoDeLpnI3rhAwdzqwek=; b=uBcfQE8SlmI50cXhBRPvA14mvin+/DwbVATQMCFpzh1Qv7dyiTHG43O4LrhcjwV3UM 9Bh0zWaFyTR3ymiXbjSdab1cOjaY9sgX6lDaQvUxELXBiI5taKIdVay96C1yBe8xYl8p xSpBlxWqhAncBSwjpuJfQv3S7djqyWid8Z3nw=

```2011/10/3 Michel <michel.pacilli@xxxxxxx>:
> Hi,
>
> not sure if it's the good place to ask user question... tell me if so.
>
> Well I try to get the best of eigen simple example, and I'm not sure that I
> get the most :
>
> #define N 32768
>
> Matrix<float,N,1> u;

Are you really sure that you want this? For such a large size, it is
almost always a better idea to use a MatrixXf u(N).

>
> Matrix<float,N,1> v;
>
> Matrix<float,N,1> w;
>
> for(int k=0; k <NLOOP; ++k)
>
>    u = v.array() * w.array();
>
> compile with gcc and sse2 flag
>
> Well, compare to a simple for loop and aligned array, I've got around 17%
> speed up with eigen ;)
> but, is it possible to give at compile time some hints to go further, with
> unrolling, sse3,4? or other things?

I don't think that newer sse versions bring anything useful here.
Actually, fwiw, sse1 would already be enough for this particular use
case!

>
> the asm of product is:
>
>  # 86 "..\eigen\main.cpp" 1
> 	#it begins here!
>  # 0 "" 2
> /NO_APP
> 	xorl	%eax, %eax
> 	.p2align 4,,10
> L3:
> 	movaps	(%esi,%eax,4), %xmm0
> 	mulps	(%ebx,%eax,4), %xmm0
> 	movaps	%xmm0, (%edx,%eax,4)
> 	addl	\$4, %eax
> 	cmpl	\$32768, %eax
> 	jne	L3
> /APP
>  # 88 "..\eigen\main.cpp" 1
> 	#it ends here!
>
> I wonder if it could be more efficient with more than just one xmm reg, or
> prefetch ?

I can only see 1 xmm register here, and given the very simple and
predictable access pattern, there shouldn't be any reason to use
explicit prefetch instructions.

The only further optimization that I would consider here, would be
partial unrolling of this loop, try doing 2 or 4 iterations at a time.

Benoit

```

 Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/