I forgot to attached my modified benchmark On Sat, Jan 17, 2009 at 10:43 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote: > hm for a vector of size 64 your result does not make sense. Actually I > checked the generated assembly and your benchmark have several issues > such that you are not really benchmarking sum. First, the first > expression (v = one + v * 1e-10) was not properly inlined when the > vectorization was enabled. Second, this first expression is more > costly than sum. Third, you call twice sum on (almost) the same data, > and so perhaps the compiler manage to remove most of the computation > of the second call to sum only when the vectorization in disabled. > > So I rewrited the benchmark, see attached file, and now I get a slight > speed up for double/64 and more than x2 for float/64. The generated > assembly is pretty good in both cases, so why does the vectorization > not lead to higher speed up ? > > Actually, the non vectorized meta-unroller of sum is much clever than > the vectorized one because it reduces the dependency between the > instructions using a recursive divide and conquer strategy while the > vectorized one simply accumulates the coeff in a single register. See: > > sum = a; > t1 = c; > sum += b; > t1 += d; > sum += t1; > > versus: > > sum = a; > sum += b; > sum += c; > sum += d; > > Another possible issue was that the vectorized unroller loop over the > coeff in the wrong order (d,c,b,a instead of a,b,c,d). > > Anyway, I rewritted the vectorized meta unroller to use the same good > strategy, and now I get a significant speed up for double/64: x1.7 > faster ! and for float/64 I get x2.75, not bad. > > For very small sizes, it is clear that at least for Vector2d this does > not make sense to vectorize it. For float, let's check. > > However, the reason why I did that change recently is that if we don't > vectorize Vector4f::sum() then we might lost some other vectorization, > ex: > > (0.5*a+0.25*b+0.25*c).sum(); > > If we disable vectorization of sum for small sizes, then what we have > to do in Sum.h, is to automatically insert an evaluation: > > (0.5*a+0.25*b+0.25*c).eval().sum(); > > Should be easy to do. > > > Gael. > > 2009/1/17 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>: >> Act 1, 4th and last scene: >> >> i wanted to see for which size it starts getting beneficial to vectorize... >> >> attached is one more example (sum.cpp) this time with size 64, type >> double. unrolling forced. >> >> Result: x87 is still 50% faster than SSE2. >> >> The SSE2 asm is: >> >> #APP >> # 12 "dot.cpp" 1 >> #a >> # 0 "" 2 >> #NO_APP >> movapd (%esi), %xmm1 >> movapd -104(%ebp), %xmm0 >> addpd -88(%ebp), %xmm0 >> addpd -120(%ebp), %xmm0 >> addpd -136(%ebp), %xmm0 >> addpd -152(%ebp), %xmm0 >> addpd -168(%ebp), %xmm0 >> addpd -184(%ebp), %xmm0 >> addpd -200(%ebp), %xmm0 >> addpd -216(%ebp), %xmm0 >> addpd -232(%ebp), %xmm0 >> addpd -248(%ebp), %xmm0 >> addpd -264(%ebp), %xmm0 >> addpd -280(%ebp), %xmm0 >> addpd -296(%ebp), %xmm0 >> addpd -312(%ebp), %xmm0 >> addpd -328(%ebp), %xmm0 >> addpd -344(%ebp), %xmm0 >> addpd -360(%ebp), %xmm0 >> addpd -376(%ebp), %xmm0 >> addpd -392(%ebp), %xmm0 >> addpd -408(%ebp), %xmm0 >> addpd -424(%ebp), %xmm0 >> addpd -440(%ebp), %xmm0 >> addpd -456(%ebp), %xmm0 >> addpd -472(%ebp), %xmm0 >> addpd -488(%ebp), %xmm0 >> addpd -504(%ebp), %xmm0 >> addpd -520(%ebp), %xmm0 >> addpd -536(%ebp), %xmm0 >> addpd -552(%ebp), %xmm0 >> addpd -568(%ebp), %xmm0 >> addpd %xmm0, %xmm1 >> movapd %xmm1, %xmm2 >> unpckhpd %xmm1, %xmm2 >> addsd %xmm2, %xmm1 >> movapd %xmm1, %xmm2 >> movsd %xmm2, -584(%ebp) >> #APP >> # 14 "dot.cpp" 1 >> #b >> # 0 "" 2 >> #NO_APP >> >> >> >> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>: >>> and in case you wonder: the situation is almost the same for dot product >>> >>> (attached file dot.cpp) >>> >>> runs 2x slower with SSE... >>> >>> Cheers, >>> Benoit >>> >>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>: >>>> Note: i think the old heuristic was wrong anyway. >>>> >>>> Maybe take this occasion to introduce a EIGEN_COST_OF_PREDUX (since >>>> this cost depends greatly on the simd platform) ? >>>> And then do a natural heuristic rather than a quick hack like we used to have? >>>> >>>> Benoit >>>> >>>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>: >>>>> Hi Gael *cough* List, >>>>> >>>>> ei_predux is costly because it consists of >1 SIMD instruction. >>>>> >>>>> So until recently we had sum() only vectorize if the size was big enough. >>>>> However this was recently changed. >>>>> >>>>> Attached is a benchmark that runs 2.5x slower with SSE (2 or 3) than >>>>> without. It's just Vector2d::sum(). >>>>> >>>>> So, revert to old behavior? >>>>> >>>>> Moreover: matrix product innerVectorization also uses a ei_predux. Same here? >>>>> >>>>> Cheers, >>>>> Benoit >>>>> >>>> >>> >> >

#define EIGEN_UNROLLING_LIMIT 10000 #include <Eigen/Core> #include <vector> typedef Eigen::Matrix<double, 64, 1> T; EIGEN_DONT_INLINE void foo(std::vector<T>& vs); int main() { T v; int n = 500/v.size(); std::vector<T> vs(n); for (int i=0; i<n; ++i) vs[i].setZero(); for(int i = 0; i < 10000000; i++) { foo(vs); } return 0; } EIGEN_DONT_INLINE void foo(std::vector<T>& vs) { for (int i=0; i<vs.size(); ++i) vs[i][0] = vs[i].sum(); }

