2009/11/24 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> in the SSE4 version you have 2 unnecessary moves, one useless load, and one
> useless store. That's the main reason. Now why GCC  does not optimize them
> away, well I've no clue...

OK, i tried a different benchmark, this time there's an addition and
it can't keep everything in registers,

#include <Eigen/Dense>
using namespace Eigen;
using namespace std;

EIGEN_DONT_INLINE int foo(VectorXi& w)
  VectorXi v = VectorXi::Random(1000);
  v += (v.cwise()*v).cwise()*w;
  return v(ei_random<int>(0,999));

int main()
  VectorXi w = VectorXi::Random(1000);
  for(int i = 0; i<100000; i++) foo(w);

Non-vectorized:    1.91 s
SSE4.1:             2.41 s

so this time it's 26% faster...

Cheers to Intel's marketing dept.

