Re: [eigen] Re: small sums: vectorization not worth it |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Re: small sums: vectorization not worth it
- From: "Gael Guennebaud" <gael.guennebaud@xxxxxxxxx>
- Date: Sat, 17 Jan 2009 10:43:45 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=74AHKLTuc0bVLVAB4A+ckc73sx5zvnQeegx0G81mDkk=; b=oQxij51S0yPFp3jNJCXFmOaK/rOgGuGDb87BlPY7n10l/DSWSBooYc8ObsrVc5jPd0 Sx7VlVYj9tUrGcRKGAqtmYo8NYXLpfp/v1VO9KIW1qZRXoJv4A6wi37mMvzCKpO5YOTz jLftv5zjmc34mbLUw8OQzQVgP03dpUk1Ve5Ew=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=lYK1yVLQhDXRnweNzBGq/7Urq7mtSPHCrtX62gIZRcC+EVa3uSh707lXq+zTcb1AID P7ElraapJdQrlK7icr9MUS2PqHbLG7fqoRsTS72an5RIfbCl66XsLAfvtnd8eaUiw2Sc cwALqQUKm7f6L6+js66CXeYm4GsQ39w3UtIxI=
hm for a vector of size 64 your result does not make sense. Actually I
checked the generated assembly and your benchmark have several issues
such that you are not really benchmarking sum. First, the first
expression (v = one + v * 1e-10) was not properly inlined when the
vectorization was enabled. Second, this first expression is more
costly than sum. Third, you call twice sum on (almost) the same data,
and so perhaps the compiler manage to remove most of the computation
of the second call to sum only when the vectorization in disabled.
So I rewrited the benchmark, see attached file, and now I get a slight
speed up for double/64 and more than x2 for float/64. The generated
assembly is pretty good in both cases, so why does the vectorization
not lead to higher speed up ?
Actually, the non vectorized meta-unroller of sum is much clever than
the vectorized one because it reduces the dependency between the
instructions using a recursive divide and conquer strategy while the
vectorized one simply accumulates the coeff in a single register. See:
sum = a;
t1 = c;
sum += b;
t1 += d;
sum += t1;
versus:
sum = a;
sum += b;
sum += c;
sum += d;
Another possible issue was that the vectorized unroller loop over the
coeff in the wrong order (d,c,b,a instead of a,b,c,d).
Anyway, I rewritted the vectorized meta unroller to use the same good
strategy, and now I get a significant speed up for double/64: x1.7
faster ! and for float/64 I get x2.75, not bad.
For very small sizes, it is clear that at least for Vector2d this does
not make sense to vectorize it. For float, let's check.
However, the reason why I did that change recently is that if we don't
vectorize Vector4f::sum() then we might lost some other vectorization,
ex:
(0.5*a+0.25*b+0.25*c).sum();
If we disable vectorization of sum for small sizes, then what we have
to do in Sum.h, is to automatically insert an evaluation:
(0.5*a+0.25*b+0.25*c).eval().sum();
Should be easy to do.
Gael.
2009/1/17 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> Act 1, 4th and last scene:
>
> i wanted to see for which size it starts getting beneficial to vectorize...
>
> attached is one more example (sum.cpp) this time with size 64, type
> double. unrolling forced.
>
> Result: x87 is still 50% faster than SSE2.
>
> The SSE2 asm is:
>
> #APP
> # 12 "dot.cpp" 1
> #a
> # 0 "" 2
> #NO_APP
> movapd (%esi), %xmm1
> movapd -104(%ebp), %xmm0
> addpd -88(%ebp), %xmm0
> addpd -120(%ebp), %xmm0
> addpd -136(%ebp), %xmm0
> addpd -152(%ebp), %xmm0
> addpd -168(%ebp), %xmm0
> addpd -184(%ebp), %xmm0
> addpd -200(%ebp), %xmm0
> addpd -216(%ebp), %xmm0
> addpd -232(%ebp), %xmm0
> addpd -248(%ebp), %xmm0
> addpd -264(%ebp), %xmm0
> addpd -280(%ebp), %xmm0
> addpd -296(%ebp), %xmm0
> addpd -312(%ebp), %xmm0
> addpd -328(%ebp), %xmm0
> addpd -344(%ebp), %xmm0
> addpd -360(%ebp), %xmm0
> addpd -376(%ebp), %xmm0
> addpd -392(%ebp), %xmm0
> addpd -408(%ebp), %xmm0
> addpd -424(%ebp), %xmm0
> addpd -440(%ebp), %xmm0
> addpd -456(%ebp), %xmm0
> addpd -472(%ebp), %xmm0
> addpd -488(%ebp), %xmm0
> addpd -504(%ebp), %xmm0
> addpd -520(%ebp), %xmm0
> addpd -536(%ebp), %xmm0
> addpd -552(%ebp), %xmm0
> addpd -568(%ebp), %xmm0
> addpd %xmm0, %xmm1
> movapd %xmm1, %xmm2
> unpckhpd %xmm1, %xmm2
> addsd %xmm2, %xmm1
> movapd %xmm1, %xmm2
> movsd %xmm2, -584(%ebp)
> #APP
> # 14 "dot.cpp" 1
> #b
> # 0 "" 2
> #NO_APP
>
>
>
> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> and in case you wonder: the situation is almost the same for dot product
>>
>> (attached file dot.cpp)
>>
>> runs 2x slower with SSE...
>>
>> Cheers,
>> Benoit
>>
>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> Note: i think the old heuristic was wrong anyway.
>>>
>>> Maybe take this occasion to introduce a EIGEN_COST_OF_PREDUX (since
>>> this cost depends greatly on the simd platform) ?
>>> And then do a natural heuristic rather than a quick hack like we used to have?
>>>
>>> Benoit
>>>
>>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>> Hi Gael *cough* List,
>>>>
>>>> ei_predux is costly because it consists of >1 SIMD instruction.
>>>>
>>>> So until recently we had sum() only vectorize if the size was big enough.
>>>> However this was recently changed.
>>>>
>>>> Attached is a benchmark that runs 2.5x slower with SSE (2 or 3) than
>>>> without. It's just Vector2d::sum().
>>>>
>>>> So, revert to old behavior?
>>>>
>>>> Moreover: matrix product innerVectorization also uses a ei_predux. Same here?
>>>>
>>>> Cheers,
>>>> Benoit
>>>>
>>>
>>
>