2011/10/3 Michel <michel.pacilli@xxxxxxx>: > Hi, > > not sure if it's the good place to ask user question... tell me if so. > > Well I try to get the best of eigen simple example, and I'm not sure that I > get the most : > > #define N 32768 > > Matrix<float,N,1> u; Are you really sure that you want this? For such a large size, it is almost always a better idea to use a MatrixXf u(N). > > Matrix<float,N,1> v; > > Matrix<float,N,1> w; > > for(int k=0; k <NLOOP; ++k) > > u = v.array() * w.array(); > > compile with gcc and sse2 flag > > Well, compare to a simple for loop and aligned array, I've got around 17% > speed up with eigen ;) > but, is it possible to give at compile time some hints to go further, with > unrolling, sse3,4? or other things? I don't think that newer sse versions bring anything useful here. Actually, fwiw, sse1 would already be enough for this particular use case! > > the asm of product is: > > # 86 "..\eigen\main.cpp" 1 > #it begins here! > # 0 "" 2 > /NO_APP > xorl %eax, %eax > .p2align 4,,10 > L3: > movaps (%esi,%eax,4), %xmm0 > mulps (%ebx,%eax,4), %xmm0 > movaps %xmm0, (%edx,%eax,4) > addl $4, %eax > cmpl $32768, %eax > jne L3 > /APP > # 88 "..\eigen\main.cpp" 1 > #it ends here! > > I wonder if it could be more efficient with more than just one xmm reg, or > prefetch ? I can only see 1 xmm register here, and given the very simple and predictable access pattern, there shouldn't be any reason to use explicit prefetch instructions. The only further optimization that I would consider here, would be partial unrolling of this loop, try doing 2 or 4 iterations at a time. Benoit

