Re: [eigen] generic unrollers

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] generic unrollers
From: Benoît Jacob <jacob@xxxxxxxxxxxxxxx>
Date: Mon, 16 Jun 2008 07:11:55 +0200

Hi,

Thanks a lot for the numbers!

I tried but was missing the BenchTimer.h file. As you say it'd be very 
interesting to also test r*SC <= (r-1) * RC and also to test without 
vectorization. We obtained different results in the past so I'd like to be 
able to run the test myself (i have a 32bit core1).

It's very interesting that evaluating as soon as N>1 gives best performance; I 
had noticed, too, that it was often better to evaluate more rather than less; 
this realization is what can make Eigen much faster than other ET-based libs 
as they tend to go blindly for lazy evaluation. If we go for 
eval-as-soon-as-N>1 then we should just split ei_nested into two structs not 
taking any N argument.

Cheers,
Benoit

On Monday 16 June 2008 00:07:22 Gael Guennebaud wrote:
> >> , for the a+b and 2*a cases I'll
> >> write an exhaustive benchmark... If there is no obvious reason to eval
> >> a+b for a 2x2 product then it might be better to not eval since this
> >> allows the user to perform fine tuning for his specific case that is not
> >> possible if we do (abusive?) evaluation.
> >
> > It's great if you do a benchmark, I don't see any other way of moving
> > forward!
>
> here you go (see attached files). So M,N,K denotes the size of the
> matrix product:
>
> MxN  = MxK  *   KxN.
>
> I benchmarked  both (a+b)*c and (2*a)*c,  with 4 different conditions:
> the current one with "<", the same with "<=", never evaluate, and
> evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled
> with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled.
>
> So for this benchmark it is quite clear that, as expected, "<=" works
> much better than the current "<".  But surprisingly, N>1, which
> implies the evaluation of (2*a) with N==2 works even slightly better !
>  This is probably because the compiler can cache the temporaries into
> the registers (I have a 64bits CPU, so 16 SSE registers). In that case
> counting for the extra loads and stores is wrong. So we could try this
> one:
>
>     r*SC <= (r-1) * RC
>
> which basically means let's forget the extra store and evaluate even
> if it does not look really better (equality).  In practice this should
> give better results (at least for gcc-4.2 with a lot of floating point
> registers).
>
> Gael.

Attachment: signature.asc
Description: This is a digitally signed message part.

References:
- [eigen] generic unrollers
  - From: Benoît Jacob
- Re: [eigen] generic unrollers
  - From: Benoît Jacob
- Re: [eigen] generic unrollers
  - From: Gael Guennebaud

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] generic unrollers
Next by Date: [eigen] compile time statistics
Previous by thread: Re: [eigen] generic unrollers
Next by thread: Re: [eigen] generic unrollers

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/