Re: [eigen] Re: small sums: vectorization not worth it

[ Thread Index | Date Index | More Archives ]

2009/1/17 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> hm for a vector of size 64 your result does not make sense. Actually I
> checked the generated assembly and your benchmark have several issues
> such that you are not really benchmarking sum. First, the first
> expression (v = one + v * 1e-10)  was not properly inlined when the
> vectorization was enabled. Second, this first expression is more
> costly than sum. Third, you call twice sum on (almost) the same data,
> and so perhaps the compiler manage to remove most of the computation
> of the second call to sum only when the vectorization in disabled.
> So I rewrited the benchmark, see attached file, and now I get a slight
> speed up for double/64 and more than x2 for float/64. The generated
> assembly is pretty good in both cases, so why does the vectorization
> not lead to higher speed up ?
> Actually, the non vectorized meta-unroller of sum is much clever than
> the vectorized one because it reduces the dependency between the
> instructions using a recursive divide and conquer strategy while the
> vectorized one simply accumulates the coeff in a single register.

I didn't realize it was so important.
So should we have a similar strategy in the product innervec ?

> Another possible issue was that the vectorized unroller loop over the
> coeff in the wrong order (d,c,b,a instead of a,b,c,d).
> Anyway, I rewritted the vectorized meta unroller to use the same good
> strategy, and now I get a significant speed up for double/64: x1.7
> faster ! and for float/64 I get x2.75, not bad.

Wow, thanks a lot!

> For very small sizes, it is clear that at least for Vector2d this does
> not make sense to vectorize it. For float, let's check.
> However, the reason why I did that change recently is that if we don't
> vectorize Vector4f::sum() then we might lost some other vectorization,
> ex:
> (0.5*a+0.25*b+0.25*c).sum();
> If we disable vectorization of sum for small sizes, then what we have
> to do in Sum.h, is to automatically insert an evaluation:
> (0.5*a+0.25*b+0.25*c).eval().sum();
> Should be easy to do.

I don't understand.
Wouldn't it be easy, and much better, to add a more intelligent
heuristic based on the expression's cost and size, the packet size,
the cost of adding scalars, and perhaps on
ei_cost_of_predux<Scalar>::ret that we may need to introduce?

So with your example, this sum would still be vectorized because the
xpr is costly enough.


Mail converted by MHonArc 2.6.19+