Re: [eigen] generic unrollers |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
>> , for the a+b and 2*a cases I'll >> write an exhaustive benchmark... If there is no obvious reason to eval a+b >> for a 2x2 product then it might be better to not eval since this allows the >> user to perform fine tuning for his specific case that is not possible if >> we do (abusive?) evaluation. > > It's great if you do a benchmark, I don't see any other way of moving forward! > here you go (see attached files). So M,N,K denotes the size of the matrix product: MxN = MxK * KxN. I benchmarked both (a+b)*c and (2*a)*c, with 4 different conditions: the current one with "<", the same with "<=", never evaluate, and evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled. So for this benchmark it is quite clear that, as expected, "<=" works much better than the current "<". But surprisingly, N>1, which implies the evaluation of (2*a) with N==2 works even slightly better ! This is probably because the compiler can cache the temporaries into the registers (I have a 64bits CPU, so 16 SSE registers). In that case counting for the extra loads and stores is wrong. So we could try this one: r*SC <= (r-1) * RC which basically means let's forget the extra store and evaluate even if it does not look really better (equality). In practice this should give better results (at least for gcc-4.2 with a lot of floating point registers). Gael.
Attachment:
EigenCostModel.ods
Description: application/vnd.oasis.opendocument.spreadsheet
// g++ -O3 -DNDEBUG -DMATSIZE=<x> benchmark.cpp -o benchmark && time ./benchmark #include <Eigen/Core> #include <Eigen/Array> #include "BenchTimer.h" using namespace std; using namespace Eigen; // USING_PART_OF_NAMESPACE_EIGEN void consume(void*); template<typename Scalar, int M, int N, int K> void bench(void) { Matrix<Scalar,M,K> a, b; Matrix<Scalar,K,N> c; Matrix<Scalar,M,N> d; a.setRandom(); b.setRandom(); c.setRandom(); d.setRandom(); Scalar s = ei_random<Scalar>(); int nrepeats = 1000000 / (M*N*K); std::cout << typeid(Scalar).name() << " " << M << " x " << N << " x " << K << ":\n"; BenchTimer timer; timer.reset(); for (uint k=0; k<10; ++k) { a.setRandom(); b.setRandom(); c.setRandom(); d.setRandom(); s = ei_random<Scalar>(); timer.start(); for (uint i=0; i<nrepeats; ++i) { d += (a + b) * c; a *= 0.99; d += (a + b) * c; a *= 0.99; d += (a + b) * c; a *= 0.99; d += (a + b) * c; a *= 0.99; d += (a + b) * c; a *= 0.99; } timer.stop(); consume(d.data()); } std::cout << " d += (a + b) * c :" << timer.value() << "\n"; nrepeats /= 2; timer.reset(); for (uint k=0; k<10; ++k) { a.setRandom(); b.setRandom(); c.setRandom(); d.setRandom(); s = ei_random<Scalar>(); timer.start(); for (uint i=0; i<nrepeats; ++i) { d += (s * a) * c; a *= 0.99; d += (s * a) * c; a *= 0.99; d += (s * a) * c; a *= 0.99; d += (s * a) * c; a *= 0.99; d += (s * a) * c; a *= 0.99; } timer.stop(); consume(d.data()); } std::cout << " d += (s * a) * c :" << timer.value() << "\n"; } int main(int argc, char *argv[]) { bench<float,2,2,2>(); bench<float,2,2,4>(); bench<float,4,2,4>(); bench<float,6,2,6>(); bench<float,8,2,8>(); bench<float,2,3,2>(); bench<float,2,3,4>(); bench<float,4,3,4>(); bench<float,6,3,6>(); bench<float,8,3,8>(); bench<float,2,4,2>(); bench<float,2,4,4>(); bench<float,4,4,4>(); bench<float,6,4,6>(); bench<float,8,4,8>(); return 0; } void consume(void* ptr) { }
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |