Re: [eigen] generic unrollers

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


>> , for the a+b and 2*a cases I'll
>> write an exhaustive benchmark... If there is no obvious reason to eval a+b
>> for a 2x2 product then it might be better to not eval since this allows the
>> user to perform fine tuning for his specific case that is not possible if
>> we do (abusive?) evaluation.
>
> It's great if you do a benchmark, I don't see any other way of moving forward!
>

here you go (see attached files). So M,N,K denotes the size of the
matrix product:

MxN  = MxK  *   KxN.

I benchmarked  both (a+b)*c and (2*a)*c,  with 4 different conditions:
the current one with "<", the same with "<=", never evaluate, and
evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled
with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled.

So for this benchmark it is quite clear that, as expected, "<=" works
much better than the current "<".  But surprisingly, N>1, which
implies the evaluation of (2*a) with N==2 works even slightly better !
 This is probably because the compiler can cache the temporaries into
the registers (I have a 64bits CPU, so 16 SSE registers). In that case
counting for the extra loads and stores is wrong. So we could try this
one:

    r*SC <= (r-1) * RC

which basically means let's forget the extra store and evaluate even
if it does not look really better (equality).  In practice this should
give better results (at least for gcc-4.2 with a lot of floating point
registers).

Gael.

Attachment: EigenCostModel.ods
Description: application/vnd.oasis.opendocument.spreadsheet

// g++ -O3 -DNDEBUG -DMATSIZE=<x> benchmark.cpp -o benchmark && time ./benchmark
#include <Eigen/Core>
#include <Eigen/Array>
#include "BenchTimer.h"

using namespace std;
using namespace Eigen;
// USING_PART_OF_NAMESPACE_EIGEN

void consume(void*);

template<typename Scalar, int M, int N, int K>
void bench(void)
{
  Matrix<Scalar,M,K> a, b;
  Matrix<Scalar,K,N> c;
  Matrix<Scalar,M,N> d;
  a.setRandom();
  b.setRandom();
  c.setRandom();
  d.setRandom();
  Scalar s = ei_random<Scalar>();

  int nrepeats = 1000000 / (M*N*K);

  std::cout << typeid(Scalar).name() << "  " << M << " x " << N << " x " << K << ":\n";

  BenchTimer timer;
  timer.reset();
  for (uint k=0; k<10; ++k)
  {
    a.setRandom();
    b.setRandom();
    c.setRandom();
    d.setRandom();
    s = ei_random<Scalar>();
    timer.start();
    for (uint i=0; i<nrepeats; ++i)
    {
      d += (a + b) * c;
      a *= 0.99;
      d += (a + b) * c;
      a *= 0.99;
      d += (a + b) * c;
      a *= 0.99;
      d += (a + b) * c;
      a *= 0.99;
      d += (a + b) * c;
      a *= 0.99;
    }
    timer.stop();
    consume(d.data());
  }
  std::cout << "  d += (a + b) * c  :" << timer.value() << "\n";

  nrepeats /= 2;
  timer.reset();
  for (uint k=0; k<10; ++k)
  {
    a.setRandom();
    b.setRandom();
    c.setRandom();
    d.setRandom();
    s = ei_random<Scalar>();
    timer.start();
    for (uint i=0; i<nrepeats; ++i)
    {
      d += (s * a) * c;
      a *= 0.99;
      d += (s * a) * c;
      a *= 0.99;
      d += (s * a) * c;
      a *= 0.99;
      d += (s * a) * c;
      a *= 0.99;
      d += (s * a) * c;
      a *= 0.99;
    }
    timer.stop();
    consume(d.data());
  }
  std::cout << "  d += (s * a) * c  :" << timer.value() << "\n";
}

int main(int argc, char *argv[])
{
    bench<float,2,2,2>();
    bench<float,2,2,4>();
    bench<float,4,2,4>();
    bench<float,6,2,6>();
    bench<float,8,2,8>();

    bench<float,2,3,2>();
    bench<float,2,3,4>();
    bench<float,4,3,4>();
    bench<float,6,3,6>();
    bench<float,8,3,8>();

    bench<float,2,4,2>();
    bench<float,2,4,4>();
    bench<float,4,4,4>();
    bench<float,6,4,6>();
    bench<float,8,4,8>();

    return 0;
}

void consume(void* ptr)
{
}



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/