Re: [eigen] generic unrollers

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


OK, looking back at the numbers, that makes perfect sense. Let's keep <=.
Just, your numbers suggest that we are slightly underestimating the cost of 
evaluating a scalar-multiple operation. Would be worth checking, perhaps we 
forgot to take the load into account in that case, and perhaps fixing that 
would finish giving <= the same performance as "n>1" in all remaining cases 
(2x2x4 etc).
Cheers,
Benoit

On Wednesday 18 June 2008 00:37:58 Gael Guennebaud wrote:
> Seeing your numbers,  it seems the perfect formula should take into
> account the register pressure: without vectorization and only 8
> floating point registers the pressure is much higher and the compiler
> cannot cache the temporary there, so the evaluation becomes quite more
> expensive.  With the new ei_assign_traits we can know in ei_nested if
> the vectorization will occur or not  and the compiler defines some
> architecture  tokens (_i386, _x86_64, etc.)  such that we can know the
> number of registers... so theoretically we can do very sophisticated
> tests but this sounds to me a bit overkill :)   the formula with  "<="
> sounds a really good compromise to me.
>
> Gael.
>
> On Tue, Jun 17, 2008 at 11:42 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> 
wrote:
> > Hi,
> >
> > attached are my own measurements. Intel Core 1 (32bit), g++ 4.3.0,
> > compiled with "-O3 -DNDEBUG", that is without vectorization, to make
> > things more fun since you already measured with vectorization.
> >
> > As you can see, here, the overall winner is "<=" but not by a big margin.
> > It looks like we have yet to find the definitive formula...
> >
> > Cheers,
> > Benoit
> >
> > On Monday 16 June 2008 00:07:22 Gael Guennebaud wrote:
> >> >> , for the a+b and 2*a cases I'll
> >> >> write an exhaustive benchmark... If there is no obvious reason to
> >> >> eval a+b for a 2x2 product then it might be better to not eval since
> >> >> this allows the user to perform fine tuning for his specific case
> >> >> that is not possible if we do (abusive?) evaluation.
> >> >
> >> > It's great if you do a benchmark, I don't see any other way of moving
> >> > forward!
> >>
> >> here you go (see attached files). So M,N,K denotes the size of the
> >> matrix product:
> >>
> >> MxN  = MxK  *   KxN.
> >>
> >> I benchmarked  both (a+b)*c and (2*a)*c,  with 4 different conditions:
> >> the current one with "<", the same with "<=", never evaluate, and
> >> evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled
> >> with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled.
> >>
> >> So for this benchmark it is quite clear that, as expected, "<=" works
> >> much better than the current "<".  But surprisingly, N>1, which
> >> implies the evaluation of (2*a) with N==2 works even slightly better !
> >>  This is probably because the compiler can cache the temporaries into
> >> the registers (I have a 64bits CPU, so 16 SSE registers). In that case
> >> counting for the extra loads and stores is wrong. So we could try this
> >> one:
> >>
> >>     r*SC <= (r-1) * RC
> >>
> >> which basically means let's forget the extra store and evaluate even
> >> if it does not look really better (equality).  In practice this should
> >> give better results (at least for gcc-4.2 with a lot of floating point
> >> registers).
> >>
> >> Gael.


Attachment: signature.asc
Description: This is a digitally signed message part.



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/