Re: [eigen] Re: small sums: vectorization not worth it |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Re: small sums: vectorization not worth it
- From: "Benoit Jacob" <jacob.benoit.1@xxxxxxxxx>
- Date: Sat, 17 Jan 2009 14:38:06 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=kqhprWwnrk81tszqK44JWjx9q0AbsTSDf+JQD/F0x98=; b=ifNViVjY+pITwE3ENULDhti7xph+U4IfnHJh0t2owo5dS5EHf894uvYXRGCdFzTs74 oIctD8Fqe/71I67tdyf6g17TX84meH9N4R9xAU5NM6bNk/Uo40IMEenxtnypeBDmiEU8 Q01TqpSvWTOOVMgjXAGBHvxClnBGB0baHojwQ=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=FLFYWTLFmnsj9JI45Z+uhdhCOcvoSJwud83MBRD4eViUh+bQ7M3ucr7Qm7F3y+wxRp lW+7Z3br25P1/ri+JlM/L0jzmOUUbd3ddvlumb4Py3vfi86xbxvKhu/UfBn1jPvNfG5u vetrQGOltHRUJwLGZFct6PAD9NO6aDlQXu4q0=
2009/1/17 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> hm for a vector of size 64 your result does not make sense. Actually I
> checked the generated assembly and your benchmark have several issues
> such that you are not really benchmarking sum. First, the first
> expression (v = one + v * 1e-10) was not properly inlined when the
> vectorization was enabled. Second, this first expression is more
> costly than sum. Third, you call twice sum on (almost) the same data,
> and so perhaps the compiler manage to remove most of the computation
> of the second call to sum only when the vectorization in disabled.
>
> So I rewrited the benchmark, see attached file, and now I get a slight
> speed up for double/64 and more than x2 for float/64. The generated
> assembly is pretty good in both cases, so why does the vectorization
> not lead to higher speed up ?
>
> Actually, the non vectorized meta-unroller of sum is much clever than
> the vectorized one because it reduces the dependency between the
> instructions using a recursive divide and conquer strategy while the
> vectorized one simply accumulates the coeff in a single register.
I didn't realize it was so important.
So should we have a similar strategy in the product innervec ?
> Another possible issue was that the vectorized unroller loop over the
> coeff in the wrong order (d,c,b,a instead of a,b,c,d).
>
> Anyway, I rewritted the vectorized meta unroller to use the same good
> strategy, and now I get a significant speed up for double/64: x1.7
> faster ! and for float/64 I get x2.75, not bad.
Wow, thanks a lot!
>
> For very small sizes, it is clear that at least for Vector2d this does
> not make sense to vectorize it. For float, let's check.
>
> However, the reason why I did that change recently is that if we don't
> vectorize Vector4f::sum() then we might lost some other vectorization,
> ex:
>
> (0.5*a+0.25*b+0.25*c).sum();
>
> If we disable vectorization of sum for small sizes, then what we have
> to do in Sum.h, is to automatically insert an evaluation:
>
> (0.5*a+0.25*b+0.25*c).eval().sum();
>
> Should be easy to do.
I don't understand.
Wouldn't it be easy, and much better, to add a more intelligent
heuristic based on the expression's cost and size, the packet size,
the cost of adding scalars, and perhaps on
ei_cost_of_predux<Scalar>::ret that we may need to introduce?
So with your example, this sum would still be vectorized because the
xpr is costly enough.
Benoit