Re: [eigen] generic unrollers

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Wednesday 11 June 2008 12:31:55 Gael Guennebaud wrote:
> I don't know about the  size of the different opcodes but that's probably
> too much to reach such a refinement...

OK sure

> That is important to me is that 
> EIGEN_UNROLLING_LIMIT remains intuitive.

Sure. While we are at it (sorry for the digression), I still believe that the 
EIGEN_TUNE_FOR_L2_CACHE_SIZE is not intuitive enough. I'd rather have a 
EIGEN_CPU_CACHE_IN_BYTES. The user would set it to the actual size of his 
cache. We could default to 1048576 (be slightly conservative). Then we could 
have a second define EIGEN_FIT_MATRICES_IN_CPU_CACHE indicating how many 
matrices we want to fit. Default to 4, I guess. You could then compute 
whatever constant it useful to you at compile time, even using sqrt() if 
needed (or if that's a problem we can make a rough approximation of sqrt that 
will be good enough).


> What can I say is that the number of instruction is indeed divided by 4: I
> checked benchmark.cpp with 4x4 float matrices and no unrolling limits: 84
> lines vs 276 => /3.28. Yes for the product it's not /4 because there are
> r*c*n/4 additional instructions to move one scalar to the 4 components of a
> packet. If we neglect them we exactly get the /4 factor (well not exactly
> because there is the compiler in-between).

Thanks for having done the measurements.

> So the best would be to compute two cost value: one for the vectorized
> path, and one the classic. This could also allows to take into account for
> the cost of unaligned read/store if they occur in the instruction or
> whatever else refinement, but that'd be overkill IMO.

Yes... let's keep it simple!

> > If we evaluate R, the total cost becomes:
> >
> > c*n*RC + r*c*n*(LC+SC+MC+AC)
>
> hm... you forgot the store when you do the evaluation, then it is:
>
> c*n*(RC+SC) + r*c*n*(LC+SC+MC+AC)

Ah, now we are counting the stores. Sure makes more sense since we were 
already counting the loads.

>
> and then you exactly get the current formula:
>
> (1)     (r+1)*SC < (r-1) * RC

OK so that's where it comes from. Thanks.

>
> This formula also neglect the temporary creation overhead, but let's assume
> it's negligible.

Yes, it only exists at all in dynamic size case (it then corresponds to the 
cost of the malloc) and/or if it causes cache / register allocation trouble 
(which is too hard to quantify).

> So now we can wonder if this theoretical formula works well in practice or
> if by tweaking it we can get better result, for instance:
>
> (2)    (r+1)*SC <= (r-1) * RC
>
> mimic the previous formula when we set SC=1 and RC>1. Other option is
> neglecting the store:
>
> (3)    r*SC < (r-1) * RC
>
> So if we assumes SC=1, then (2) and (3) are exactly the same (oh !, as you
> said later, good)

Yes, we are agreeing here. In fact now that I see the rationale for (1) I 
understand (agree with you) that (2) is best.

> Now, let's summarize what happens with a 2x2 matrix product wrt to the lhs
> expression (SC=MC=AC=1, and cost(sin)=2):
>
>                 (1)      (2) or  (3)
> a+b         lazy       eval
> 2*a          lazy       lazy
> sin(a)      lazy        eval
>
> and for a 3x3 product:
>
>                 (1)      (2) or  (3)
> a+b         eval       eval
> 2*a          lazy      eval
> sin(a)      lazy       eval
>
> So definitely (1) does not look so good, in particular it implies to set a
> larger cost to sine (that might make sense)

Yes, I would prefer that, cost(sin)=2 seems far too optimistic!

> , for the a+b and 2*a cases I'll 
> write an exhaustive benchmark... If there is no obvious reason to eval a+b
> for a 2x2 product then it might be better to not eval since this allows the
> user to perform fine tuning for his specific case that is not possible if
> we do (abusive?) evaluation.

It's great if you do a benchmark, I don't see any other way of moving forward!

> > OK I understand; this probably makes a lot of sense for expressions that
> > have
> > the Like1DArrayBit (by the way, we should use that bit here to do a
> > single for loop instead of two for loops).
>
> I actually use it for the vectorized assignment, but the pb is that you
> have to unpack the 1D index to 2D coordinates...unless we add _1Dcoeff(int)
> members to all "1DArray-able" expressions...

ah ok, i see. Perhaps that's too much trouble.

Benoit

Attachment: signature.asc
Description: This is a digitally signed message part.



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/