Re: [eigen] Re: small sums: vectorization not worth it

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I forgot to attached my modified benchmark

On Sat, Jan 17, 2009 at 10:43 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> hm for a vector of size 64 your result does not make sense. Actually I
> checked the generated assembly and your benchmark have several issues
> such that you are not really benchmarking sum. First, the first
> expression (v = one + v * 1e-10)  was not properly inlined when the
> vectorization was enabled. Second, this first expression is more
> costly than sum. Third, you call twice sum on (almost) the same data,
> and so perhaps the compiler manage to remove most of the computation
> of the second call to sum only when the vectorization in disabled.
>
> So I rewrited the benchmark, see attached file, and now I get a slight
> speed up for double/64 and more than x2 for float/64. The generated
> assembly is pretty good in both cases, so why does the vectorization
> not lead to higher speed up ?
>
> Actually, the non vectorized meta-unroller of sum is much clever than
> the vectorized one because it reduces the dependency between the
> instructions using a recursive divide and conquer strategy while the
> vectorized one simply accumulates the coeff in a single register. See:
>
> sum = a;
> t1 = c;
> sum +=  b;
> t1 += d;
> sum += t1;
>
> versus:
>
> sum = a;
> sum += b;
> sum += c;
> sum += d;
>
> Another possible issue was that the vectorized unroller loop over the
> coeff in the wrong order (d,c,b,a instead of a,b,c,d).
>
> Anyway, I rewritted the vectorized meta unroller to use the same good
> strategy, and now I get a significant speed up for double/64: x1.7
> faster ! and for float/64 I get x2.75, not bad.
>
> For very small sizes, it is clear that at least for Vector2d this does
> not make sense to vectorize it. For float, let's check.
>
> However, the reason why I did that change recently is that if we don't
> vectorize Vector4f::sum() then we might lost some other vectorization,
> ex:
>
> (0.5*a+0.25*b+0.25*c).sum();
>
> If we disable vectorization of sum for small sizes, then what we have
> to do in Sum.h, is to automatically insert an evaluation:
>
> (0.5*a+0.25*b+0.25*c).eval().sum();
>
> Should be easy to do.
>
>
> Gael.
>
> 2009/1/17 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> Act 1, 4th and last scene:
>>
>> i wanted to see for which size it starts getting beneficial to vectorize...
>>
>> attached is one more example (sum.cpp) this time with size 64, type
>> double. unrolling forced.
>>
>> Result: x87 is still 50% faster than SSE2.
>>
>> The SSE2  asm is:
>>
>> #APP
>> # 12 "dot.cpp" 1
>>        #a
>> # 0 "" 2
>> #NO_APP
>>        movapd  (%esi), %xmm1
>>        movapd  -104(%ebp), %xmm0
>>        addpd   -88(%ebp), %xmm0
>>        addpd   -120(%ebp), %xmm0
>>        addpd   -136(%ebp), %xmm0
>>        addpd   -152(%ebp), %xmm0
>>        addpd   -168(%ebp), %xmm0
>>        addpd   -184(%ebp), %xmm0
>>        addpd   -200(%ebp), %xmm0
>>        addpd   -216(%ebp), %xmm0
>>        addpd   -232(%ebp), %xmm0
>>        addpd   -248(%ebp), %xmm0
>>        addpd   -264(%ebp), %xmm0
>>        addpd   -280(%ebp), %xmm0
>>        addpd   -296(%ebp), %xmm0
>>        addpd   -312(%ebp), %xmm0
>>        addpd   -328(%ebp), %xmm0
>>        addpd   -344(%ebp), %xmm0
>>        addpd   -360(%ebp), %xmm0
>>        addpd   -376(%ebp), %xmm0
>>        addpd   -392(%ebp), %xmm0
>>        addpd   -408(%ebp), %xmm0
>>        addpd   -424(%ebp), %xmm0
>>        addpd   -440(%ebp), %xmm0
>>        addpd   -456(%ebp), %xmm0
>>        addpd   -472(%ebp), %xmm0
>>        addpd   -488(%ebp), %xmm0
>>        addpd   -504(%ebp), %xmm0
>>        addpd   -520(%ebp), %xmm0
>>        addpd   -536(%ebp), %xmm0
>>        addpd   -552(%ebp), %xmm0
>>        addpd   -568(%ebp), %xmm0
>>        addpd   %xmm0, %xmm1
>>        movapd  %xmm1, %xmm2
>>        unpckhpd        %xmm1, %xmm2
>>        addsd   %xmm2, %xmm1
>>        movapd  %xmm1, %xmm2
>>        movsd   %xmm2, -584(%ebp)
>> #APP
>> # 14 "dot.cpp" 1
>>        #b
>> # 0 "" 2
>> #NO_APP
>>
>>
>>
>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> and in case you wonder: the situation is almost the same for dot product
>>>
>>> (attached file dot.cpp)
>>>
>>> runs 2x slower with SSE...
>>>
>>> Cheers,
>>> Benoit
>>>
>>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>> Note: i think the old heuristic was wrong anyway.
>>>>
>>>> Maybe take this occasion to introduce a EIGEN_COST_OF_PREDUX (since
>>>> this cost depends greatly on the simd platform) ?
>>>> And then do a natural heuristic rather than a quick hack like we used to have?
>>>>
>>>> Benoit
>>>>
>>>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>>> Hi Gael *cough* List,
>>>>>
>>>>> ei_predux is costly because it consists of >1 SIMD instruction.
>>>>>
>>>>> So until recently we had sum() only vectorize if the size was big enough.
>>>>> However this was recently changed.
>>>>>
>>>>> Attached is a benchmark that runs 2.5x slower with SSE (2 or 3) than
>>>>> without. It's just Vector2d::sum().
>>>>>
>>>>> So, revert to old behavior?
>>>>>
>>>>> Moreover: matrix product innerVectorization also uses a ei_predux. Same here?
>>>>>
>>>>> Cheers,
>>>>> Benoit
>>>>>
>>>>
>>>
>>
>
#define EIGEN_UNROLLING_LIMIT 10000
#include <Eigen/Core>
#include <vector>

typedef Eigen::Matrix<double, 64, 1> T;

EIGEN_DONT_INLINE void foo(std::vector<T>& vs);

int main()
{
  T v;
  int n = 500/v.size();
  std::vector<T> vs(n);
  for (int i=0; i<n; ++i)
    vs[i].setZero();
  for(int i = 0; i < 10000000; i++)
  {
    foo(vs);
  }
  return 0;
}

EIGEN_DONT_INLINE void foo(std::vector<T>& vs)
{
  for (int i=0; i<vs.size(); ++i)
    vs[i][0] = vs[i].sum();
}


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/