Re: [eigen] generic unrollers |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Hi, Thanks a lot for the numbers! I tried but was missing the BenchTimer.h file. As you say it'd be very interesting to also test r*SC <= (r-1) * RC and also to test without vectorization. We obtained different results in the past so I'd like to be able to run the test myself (i have a 32bit core1). It's very interesting that evaluating as soon as N>1 gives best performance; I had noticed, too, that it was often better to evaluate more rather than less; this realization is what can make Eigen much faster than other ET-based libs as they tend to go blindly for lazy evaluation. If we go for eval-as-soon-as-N>1 then we should just split ei_nested into two structs not taking any N argument. Cheers, Benoit On Monday 16 June 2008 00:07:22 Gael Guennebaud wrote: > >> , for the a+b and 2*a cases I'll > >> write an exhaustive benchmark... If there is no obvious reason to eval > >> a+b for a 2x2 product then it might be better to not eval since this > >> allows the user to perform fine tuning for his specific case that is not > >> possible if we do (abusive?) evaluation. > > > > It's great if you do a benchmark, I don't see any other way of moving > > forward! > > here you go (see attached files). So M,N,K denotes the size of the > matrix product: > > MxN = MxK * KxN. > > I benchmarked both (a+b)*c and (2*a)*c, with 4 different conditions: > the current one with "<", the same with "<=", never evaluate, and > evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled > with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled. > > So for this benchmark it is quite clear that, as expected, "<=" works > much better than the current "<". But surprisingly, N>1, which > implies the evaluation of (2*a) with N==2 works even slightly better ! > This is probably because the compiler can cache the temporaries into > the registers (I have a 64bits CPU, so 16 SSE registers). In that case > counting for the extra loads and stores is wrong. So we could try this > one: > > r*SC <= (r-1) * RC > > which basically means let's forget the extra store and evaluate even > if it does not look really better (equality). In practice this should > give better results (at least for gcc-4.2 with a lot of floating point > registers). > > Gael.
Attachment:
signature.asc
Description: This is a digitally signed message part.
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |