Re: [eigen] generic unrollers |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] generic unrollers*From*: Benoît Jacob <jacob@xxxxxxxxxxxxxxx>*Date*: Wed, 11 Jun 2008 18:24:22 +0200

On Wednesday 11 June 2008 12:31:55 Gael Guennebaud wrote: > I don't know about the size of the different opcodes but that's probably > too much to reach such a refinement... OK sure > That is important to me is that > EIGEN_UNROLLING_LIMIT remains intuitive. Sure. While we are at it (sorry for the digression), I still believe that the EIGEN_TUNE_FOR_L2_CACHE_SIZE is not intuitive enough. I'd rather have a EIGEN_CPU_CACHE_IN_BYTES. The user would set it to the actual size of his cache. We could default to 1048576 (be slightly conservative). Then we could have a second define EIGEN_FIT_MATRICES_IN_CPU_CACHE indicating how many matrices we want to fit. Default to 4, I guess. You could then compute whatever constant it useful to you at compile time, even using sqrt() if needed (or if that's a problem we can make a rough approximation of sqrt that will be good enough). > What can I say is that the number of instruction is indeed divided by 4: I > checked benchmark.cpp with 4x4 float matrices and no unrolling limits: 84 > lines vs 276 => /3.28. Yes for the product it's not /4 because there are > r*c*n/4 additional instructions to move one scalar to the 4 components of a > packet. If we neglect them we exactly get the /4 factor (well not exactly > because there is the compiler in-between). Thanks for having done the measurements. > So the best would be to compute two cost value: one for the vectorized > path, and one the classic. This could also allows to take into account for > the cost of unaligned read/store if they occur in the instruction or > whatever else refinement, but that'd be overkill IMO. Yes... let's keep it simple! > > If we evaluate R, the total cost becomes: > > > > c*n*RC + r*c*n*(LC+SC+MC+AC) > > hm... you forgot the store when you do the evaluation, then it is: > > c*n*(RC+SC) + r*c*n*(LC+SC+MC+AC) Ah, now we are counting the stores. Sure makes more sense since we were already counting the loads. > > and then you exactly get the current formula: > > (1) (r+1)*SC < (r-1) * RC OK so that's where it comes from. Thanks. > > This formula also neglect the temporary creation overhead, but let's assume > it's negligible. Yes, it only exists at all in dynamic size case (it then corresponds to the cost of the malloc) and/or if it causes cache / register allocation trouble (which is too hard to quantify). > So now we can wonder if this theoretical formula works well in practice or > if by tweaking it we can get better result, for instance: > > (2) (r+1)*SC <= (r-1) * RC > > mimic the previous formula when we set SC=1 and RC>1. Other option is > neglecting the store: > > (3) r*SC < (r-1) * RC > > So if we assumes SC=1, then (2) and (3) are exactly the same (oh !, as you > said later, good) Yes, we are agreeing here. In fact now that I see the rationale for (1) I understand (agree with you) that (2) is best. > Now, let's summarize what happens with a 2x2 matrix product wrt to the lhs > expression (SC=MC=AC=1, and cost(sin)=2): > > (1) (2) or (3) > a+b lazy eval > 2*a lazy lazy > sin(a) lazy eval > > and for a 3x3 product: > > (1) (2) or (3) > a+b eval eval > 2*a lazy eval > sin(a) lazy eval > > So definitely (1) does not look so good, in particular it implies to set a > larger cost to sine (that might make sense) Yes, I would prefer that, cost(sin)=2 seems far too optimistic! > , for the a+b and 2*a cases I'll > write an exhaustive benchmark... If there is no obvious reason to eval a+b > for a 2x2 product then it might be better to not eval since this allows the > user to perform fine tuning for his specific case that is not possible if > we do (abusive?) evaluation. It's great if you do a benchmark, I don't see any other way of moving forward! > > OK I understand; this probably makes a lot of sense for expressions that > > have > > the Like1DArrayBit (by the way, we should use that bit here to do a > > single for loop instead of two for loops). > > I actually use it for the vectorized assignment, but the pb is that you > have to unpack the 1D index to 2D coordinates...unless we add _1Dcoeff(int) > members to all "1DArray-able" expressions... ah ok, i see. Perhaps that's too much trouble. Benoit

**Attachment:
signature.asc**

**Follow-Ups**:**Re: [eigen] generic unrollers***From:*Gael Guennebaud

**References**:**[eigen] generic unrollers***From:*Benoît Jacob

**Re: [eigen] generic unrollers***From:*Benoît Jacob

**Re: [eigen] generic unrollers***From:*Gael Guennebaud

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] generic unrollers** - Next by Date:
**Re: [eigen] generic unrollers** - Previous by thread:
**Re: [eigen] generic unrollers** - Next by thread:
**Re: [eigen] generic unrollers**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |