| Re: [eigen] generic unrollers |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
>> , for the a+b and 2*a cases I'll
>> write an exhaustive benchmark... If there is no obvious reason to eval a+b
>> for a 2x2 product then it might be better to not eval since this allows the
>> user to perform fine tuning for his specific case that is not possible if
>> we do (abusive?) evaluation.
>
> It's great if you do a benchmark, I don't see any other way of moving forward!
>
here you go (see attached files). So M,N,K denotes the size of the
matrix product:
MxN = MxK * KxN.
I benchmarked both (a+b)*c and (2*a)*c, with 4 different conditions:
the current one with "<", the same with "<=", never evaluate, and
evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled
with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled.
So for this benchmark it is quite clear that, as expected, "<=" works
much better than the current "<". But surprisingly, N>1, which
implies the evaluation of (2*a) with N==2 works even slightly better !
This is probably because the compiler can cache the temporaries into
the registers (I have a 64bits CPU, so 16 SSE registers). In that case
counting for the extra loads and stores is wrong. So we could try this
one:
r*SC <= (r-1) * RC
which basically means let's forget the extra store and evaluate even
if it does not look really better (equality). In practice this should
give better results (at least for gcc-4.2 with a lot of floating point
registers).
Gael.
Attachment:
EigenCostModel.ods
Description: application/vnd.oasis.opendocument.spreadsheet
// g++ -O3 -DNDEBUG -DMATSIZE=<x> benchmark.cpp -o benchmark && time ./benchmark
#include <Eigen/Core>
#include <Eigen/Array>
#include "BenchTimer.h"
using namespace std;
using namespace Eigen;
// USING_PART_OF_NAMESPACE_EIGEN
void consume(void*);
template<typename Scalar, int M, int N, int K>
void bench(void)
{
Matrix<Scalar,M,K> a, b;
Matrix<Scalar,K,N> c;
Matrix<Scalar,M,N> d;
a.setRandom();
b.setRandom();
c.setRandom();
d.setRandom();
Scalar s = ei_random<Scalar>();
int nrepeats = 1000000 / (M*N*K);
std::cout << typeid(Scalar).name() << " " << M << " x " << N << " x " << K << ":\n";
BenchTimer timer;
timer.reset();
for (uint k=0; k<10; ++k)
{
a.setRandom();
b.setRandom();
c.setRandom();
d.setRandom();
s = ei_random<Scalar>();
timer.start();
for (uint i=0; i<nrepeats; ++i)
{
d += (a + b) * c;
a *= 0.99;
d += (a + b) * c;
a *= 0.99;
d += (a + b) * c;
a *= 0.99;
d += (a + b) * c;
a *= 0.99;
d += (a + b) * c;
a *= 0.99;
}
timer.stop();
consume(d.data());
}
std::cout << " d += (a + b) * c :" << timer.value() << "\n";
nrepeats /= 2;
timer.reset();
for (uint k=0; k<10; ++k)
{
a.setRandom();
b.setRandom();
c.setRandom();
d.setRandom();
s = ei_random<Scalar>();
timer.start();
for (uint i=0; i<nrepeats; ++i)
{
d += (s * a) * c;
a *= 0.99;
d += (s * a) * c;
a *= 0.99;
d += (s * a) * c;
a *= 0.99;
d += (s * a) * c;
a *= 0.99;
d += (s * a) * c;
a *= 0.99;
}
timer.stop();
consume(d.data());
}
std::cout << " d += (s * a) * c :" << timer.value() << "\n";
}
int main(int argc, char *argv[])
{
bench<float,2,2,2>();
bench<float,2,2,4>();
bench<float,4,2,4>();
bench<float,6,2,6>();
bench<float,8,2,8>();
bench<float,2,3,2>();
bench<float,2,3,4>();
bench<float,4,3,4>();
bench<float,6,3,6>();
bench<float,8,3,8>();
bench<float,2,4,2>();
bench<float,2,4,4>();
bench<float,4,4,4>();
bench<float,6,4,6>();
bench<float,8,4,8>();
return 0;
}
void consume(void* ptr)
{
}
| Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |