Re: [eigen] generic unrollers |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
OK, looking back at the numbers, that makes perfect sense. Let's keep <=. Just, your numbers suggest that we are slightly underestimating the cost of evaluating a scalar-multiple operation. Would be worth checking, perhaps we forgot to take the load into account in that case, and perhaps fixing that would finish giving <= the same performance as "n>1" in all remaining cases (2x2x4 etc). Cheers, Benoit On Wednesday 18 June 2008 00:37:58 Gael Guennebaud wrote: > Seeing your numbers, it seems the perfect formula should take into > account the register pressure: without vectorization and only 8 > floating point registers the pressure is much higher and the compiler > cannot cache the temporary there, so the evaluation becomes quite more > expensive. With the new ei_assign_traits we can know in ei_nested if > the vectorization will occur or not and the compiler defines some > architecture tokens (_i386, _x86_64, etc.) such that we can know the > number of registers... so theoretically we can do very sophisticated > tests but this sounds to me a bit overkill :) the formula with "<=" > sounds a really good compromise to me. > > Gael. > > On Tue, Jun 17, 2008 at 11:42 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote: > > Hi, > > > > attached are my own measurements. Intel Core 1 (32bit), g++ 4.3.0, > > compiled with "-O3 -DNDEBUG", that is without vectorization, to make > > things more fun since you already measured with vectorization. > > > > As you can see, here, the overall winner is "<=" but not by a big margin. > > It looks like we have yet to find the definitive formula... > > > > Cheers, > > Benoit > > > > On Monday 16 June 2008 00:07:22 Gael Guennebaud wrote: > >> >> , for the a+b and 2*a cases I'll > >> >> write an exhaustive benchmark... If there is no obvious reason to > >> >> eval a+b for a 2x2 product then it might be better to not eval since > >> >> this allows the user to perform fine tuning for his specific case > >> >> that is not possible if we do (abusive?) evaluation. > >> > > >> > It's great if you do a benchmark, I don't see any other way of moving > >> > forward! > >> > >> here you go (see attached files). So M,N,K denotes the size of the > >> matrix product: > >> > >> MxN = MxK * KxN. > >> > >> I benchmarked both (a+b)*c and (2*a)*c, with 4 different conditions: > >> the current one with "<", the same with "<=", never evaluate, and > >> evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled > >> with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled. > >> > >> So for this benchmark it is quite clear that, as expected, "<=" works > >> much better than the current "<". But surprisingly, N>1, which > >> implies the evaluation of (2*a) with N==2 works even slightly better ! > >> This is probably because the compiler can cache the temporaries into > >> the registers (I have a 64bits CPU, so 16 SSE registers). In that case > >> counting for the extra loads and stores is wrong. So we could try this > >> one: > >> > >> r*SC <= (r-1) * RC > >> > >> which basically means let's forget the extra store and evaluate even > >> if it does not look really better (equality). In practice this should > >> give better results (at least for gcc-4.2 with a lot of floating point > >> registers). > >> > >> Gael.
Attachment:
signature.asc
Description: This is a digitally signed message part.
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |