Re: [eigen] Re: small sums: vectorization not worth it |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Re: small sums: vectorization not worth it
- From: "Gael Guennebaud" <gael.guennebaud@xxxxxxxxx>
- Date: Sat, 17 Jan 2009 10:54:04 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=PhSZhALhkPUyJv6FlJBg2eYZikkKXKse5lQNEF0Vjpg=; b=DtwEtjPzAZUVIDCugYLRQ1xVljvTjRPZHgfNt0/VTwWvBUtV3I6HnwF3EtrBugKqkT HJrr4GwyH0NOhWWJKEeO4cPocdZk/lzeeFTXeeLdYX6mS8+xQTUFsb/sVvpGhme5r1T3 l+8KawJMNIe9c76SsYUkCMBMGICKJAwxcd8dU=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=gGR3Tc7oOBnbE65UnYPIzBZKWP7PYB4093wSpqCfsfwz/PCgo74TPp9cRbMFYLaWGA 8zpSwdJ5v0GyZB35UQPav4uHKV12wL/q/6aKiI8LGdj+bXjvO5cfPEKLro6aScYThxX5 CZTY3WN8LEnneaCeXeZbcIQmFdGNLRa6lfNIk=
I forgot to attached my modified benchmark
On Sat, Jan 17, 2009 at 10:43 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> hm for a vector of size 64 your result does not make sense. Actually I
> checked the generated assembly and your benchmark have several issues
> such that you are not really benchmarking sum. First, the first
> expression (v = one + v * 1e-10) was not properly inlined when the
> vectorization was enabled. Second, this first expression is more
> costly than sum. Third, you call twice sum on (almost) the same data,
> and so perhaps the compiler manage to remove most of the computation
> of the second call to sum only when the vectorization in disabled.
>
> So I rewrited the benchmark, see attached file, and now I get a slight
> speed up for double/64 and more than x2 for float/64. The generated
> assembly is pretty good in both cases, so why does the vectorization
> not lead to higher speed up ?
>
> Actually, the non vectorized meta-unroller of sum is much clever than
> the vectorized one because it reduces the dependency between the
> instructions using a recursive divide and conquer strategy while the
> vectorized one simply accumulates the coeff in a single register. See:
>
> sum = a;
> t1 = c;
> sum += b;
> t1 += d;
> sum += t1;
>
> versus:
>
> sum = a;
> sum += b;
> sum += c;
> sum += d;
>
> Another possible issue was that the vectorized unroller loop over the
> coeff in the wrong order (d,c,b,a instead of a,b,c,d).
>
> Anyway, I rewritted the vectorized meta unroller to use the same good
> strategy, and now I get a significant speed up for double/64: x1.7
> faster ! and for float/64 I get x2.75, not bad.
>
> For very small sizes, it is clear that at least for Vector2d this does
> not make sense to vectorize it. For float, let's check.
>
> However, the reason why I did that change recently is that if we don't
> vectorize Vector4f::sum() then we might lost some other vectorization,
> ex:
>
> (0.5*a+0.25*b+0.25*c).sum();
>
> If we disable vectorization of sum for small sizes, then what we have
> to do in Sum.h, is to automatically insert an evaluation:
>
> (0.5*a+0.25*b+0.25*c).eval().sum();
>
> Should be easy to do.
>
>
> Gael.
>
> 2009/1/17 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> Act 1, 4th and last scene:
>>
>> i wanted to see for which size it starts getting beneficial to vectorize...
>>
>> attached is one more example (sum.cpp) this time with size 64, type
>> double. unrolling forced.
>>
>> Result: x87 is still 50% faster than SSE2.
>>
>> The SSE2 asm is:
>>
>> #APP
>> # 12 "dot.cpp" 1
>> #a
>> # 0 "" 2
>> #NO_APP
>> movapd (%esi), %xmm1
>> movapd -104(%ebp), %xmm0
>> addpd -88(%ebp), %xmm0
>> addpd -120(%ebp), %xmm0
>> addpd -136(%ebp), %xmm0
>> addpd -152(%ebp), %xmm0
>> addpd -168(%ebp), %xmm0
>> addpd -184(%ebp), %xmm0
>> addpd -200(%ebp), %xmm0
>> addpd -216(%ebp), %xmm0
>> addpd -232(%ebp), %xmm0
>> addpd -248(%ebp), %xmm0
>> addpd -264(%ebp), %xmm0
>> addpd -280(%ebp), %xmm0
>> addpd -296(%ebp), %xmm0
>> addpd -312(%ebp), %xmm0
>> addpd -328(%ebp), %xmm0
>> addpd -344(%ebp), %xmm0
>> addpd -360(%ebp), %xmm0
>> addpd -376(%ebp), %xmm0
>> addpd -392(%ebp), %xmm0
>> addpd -408(%ebp), %xmm0
>> addpd -424(%ebp), %xmm0
>> addpd -440(%ebp), %xmm0
>> addpd -456(%ebp), %xmm0
>> addpd -472(%ebp), %xmm0
>> addpd -488(%ebp), %xmm0
>> addpd -504(%ebp), %xmm0
>> addpd -520(%ebp), %xmm0
>> addpd -536(%ebp), %xmm0
>> addpd -552(%ebp), %xmm0
>> addpd -568(%ebp), %xmm0
>> addpd %xmm0, %xmm1
>> movapd %xmm1, %xmm2
>> unpckhpd %xmm1, %xmm2
>> addsd %xmm2, %xmm1
>> movapd %xmm1, %xmm2
>> movsd %xmm2, -584(%ebp)
>> #APP
>> # 14 "dot.cpp" 1
>> #b
>> # 0 "" 2
>> #NO_APP
>>
>>
>>
>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> and in case you wonder: the situation is almost the same for dot product
>>>
>>> (attached file dot.cpp)
>>>
>>> runs 2x slower with SSE...
>>>
>>> Cheers,
>>> Benoit
>>>
>>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>> Note: i think the old heuristic was wrong anyway.
>>>>
>>>> Maybe take this occasion to introduce a EIGEN_COST_OF_PREDUX (since
>>>> this cost depends greatly on the simd platform) ?
>>>> And then do a natural heuristic rather than a quick hack like we used to have?
>>>>
>>>> Benoit
>>>>
>>>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>>> Hi Gael *cough* List,
>>>>>
>>>>> ei_predux is costly because it consists of >1 SIMD instruction.
>>>>>
>>>>> So until recently we had sum() only vectorize if the size was big enough.
>>>>> However this was recently changed.
>>>>>
>>>>> Attached is a benchmark that runs 2.5x slower with SSE (2 or 3) than
>>>>> without. It's just Vector2d::sum().
>>>>>
>>>>> So, revert to old behavior?
>>>>>
>>>>> Moreover: matrix product innerVectorization also uses a ei_predux. Same here?
>>>>>
>>>>> Cheers,
>>>>> Benoit
>>>>>
>>>>
>>>
>>
>
#define EIGEN_UNROLLING_LIMIT 10000
#include <Eigen/Core>
#include <vector>
typedef Eigen::Matrix<double, 64, 1> T;
EIGEN_DONT_INLINE void foo(std::vector<T>& vs);
int main()
{
T v;
int n = 500/v.size();
std::vector<T> vs(n);
for (int i=0; i<n; ++i)
vs[i].setZero();
for(int i = 0; i < 10000000; i++)
{
foo(vs);
}
return 0;
}
EIGEN_DONT_INLINE void foo(std::vector<T>& vs)
{
for (int i=0; i<vs.size(); ++i)
vs[i][0] = vs[i].sum();
}