Re: [eigen] generic unrollers |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] generic unrollers*From*: "Gael Guennebaud" <gael.guennebaud@xxxxxxxxx>*Date*: Wed, 11 Jun 2008 12:31:55 +0200*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=Aw7Be2YS8guUPD56RJf1lcL33i61LSlgOt3tT5peCS0=; b=Skat7GSQBYp6FDrbcBwQ/Zfq4dunznhiIyvdI0mDN9gGWA3clAJWYdkwKGEsWP0znn eLGv2vlsvqnPRfyEwgOy4yX7NyNTaJyOHaY0KujCZgeshKex7tpTZ/D52tL3tarIwvkO T/wc6QA6oXmtQ9efvhDols6mnCCFuXT2NoCb8=*Domainkey-signature*: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=Yc9IvHXIpUyT2GdMhOFJitQCESrPbZTE94oRnIq2+918MKTZpLWPwQOQw6TJha2Rl9 PsbBRpXALrxZs4FFhsxoyPoMn16sUugQyeY3UYbQUz91MOaKygMX1QRgu0up9akkQfrZ cCD/euULYfc6uMEDgMuTMw72PwUhRs3sdfXXY=

On Wed, Jun 11, 2008 at 11:06 AM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:

hm... I guess the same apply to me, so no pb :)

I don't know about the size of the different opcodes but that's probably too much to reach such a refinement... That is important to me is that EIGEN_UNROLLING_LIMIT remains intuitive.

What can I say is that the number of instruction is indeed divided by 4: I checked benchmark.cpp with 4x4 float matrices and no unrolling limits: 84 lines vs 276 => /3.28. Yes for the product it's not /4 because there are r*c*n/4 additional instructions to move one scalar to the 4 components of a packet. If we neglect them we exactly get the /4 factor (well not exactly because there is the compiler in-between).

So the best would be to compute two cost value: one for the vectorized path, and one the classic. This could also allows to take into account for the cost of unaligned read/store if they occur in the instruction or whatever else refinement, but that'd be overkill IMO.

hm... you forgot the store when you do the evaluation, then it is:

c*n*(RC+SC) + r*c*n*(LC+SC+MC+AC)

and then you exactly get the current formula:

(1) (r+1)*SC < (r-1) * RC

This formula also neglect the temporary creation overhead, but let's assume it's negligible.

So now we can wonder if this theoretical formula works well in practice or if by tweaking it we can get better result, for instance:

(2) (r+1)*SC <= (r-1) * RC

mimic the previous formula when we set SC=1 and RC>1. Other option is neglecting the store:

(3) r*SC < (r-1) * RC

So if we assumes SC=1, then (2) and (3) are exactly the same (oh !, as you said later, good)

Now, let's summarize what happens with a 2x2 matrix product wrt to the lhs _expression_ (SC=MC=AC=1, and cost(sin)=2):

(1) (2) or (3)

a+b lazy eval

2*a lazy lazy

sin(a) lazy eval

and for a 3x3 product:

(1) (2) or (3)

a+b eval eval

2*a lazy eval

sin(a) lazy eval

So definitely (1) does not look so good, in particular it implies to set a larger cost to sine (that might make sense), for the a+b and 2*a cases I'll write an exhaustive benchmark... If there is no obvious reason to eval a+b for a 2x2 product then it might be better to not eval since this allows the user to perform fine tuning for his specific case that is not possible if we do (abusive?) evaluation.

right, assuming we go for (2) or (3) this allows to intuitively (experimentally?) fine tune nested evaluation per unary operator using a cost of 1 or 2 without affecting too much the unrolling (of course if the number of instruction of the operator is >1 use it !)

I actually use it for the vectorized assignment, but the pb is that you have to unpack the 1D index to 2D coordinates...unless we add _1Dcoeff(int) members to all "1DArray-able" expressions...

agree.

Gael.

First of all: I realize that I sound like a bastard in English :) at least I

know I sound less friendly in English than in French, so just don't worry

about that :)

hm... I guess the same apply to me, so no pb :)

On Tuesday 10 June 2008 20:28:38 Gael Guennebaud wrote:Yes I understand! That's funny, this is a case where size and cost differ:

> On Tue, Jun 10, 2008 at 5:33 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:

> > On Monday 09 June 2008 16:15:33 Gael Guennebaud wrote:

> > > First, the unrolling decision should take into account the

> > > vectorization since for float it reduces the instruction count by a

> > > factor four, so the unrolling cost should be around: (4*4)*2 flops +

> > > 2*16 reads = 96 that is just bellow 100.

> >

> > Ah OK, good to know.

>

> but we still have to divide by the size of the packet...

size gets divided by 4 theoretically, while the speed is typically multiplied

by 2.

I'm still fine for dividing by the size of the packet, at least that's a

natural choice. But is the code size really divided by that or is just a

theoretical maximum that's never reached? For example vectorized registers

are 128 bit so the load/store operations on them probably take more bytes,

no? And what about the opcodes, probably they all have at least 2 bytes?

Though, that may also be the case of x87 opcodes, I don't know. So, somehow I

was expecting 2x-3x to be more realistic than 4x.

I don't know about the size of the different opcodes but that's probably too much to reach such a refinement... That is important to me is that EIGEN_UNROLLING_LIMIT remains intuitive.

What can I say is that the number of instruction is indeed divided by 4: I checked benchmark.cpp with 4x4 float matrices and no unrolling limits: 84 lines vs 276 => /3.28. Yes for the product it's not /4 because there are r*c*n/4 additional instructions to move one scalar to the 4 components of a packet. If we neglect them we exactly get the /4 factor (well not exactly because there is the compiler in-between).

So the best would be to compute two cost value: one for the vectorized path, and one the classic. This could also allows to take into account for the cost of unaligned read/store if they occur in the instruction or whatever else refinement, but that'd be overkill IMO.

> pop pop pop.... you're damn right, the cost of basic operations is alreadyOK, after sending the email I told me that it had to be just that...

> 1, I was really sure about the 2.....

I'm puzzled too, because now I believe that our current formula is wrong and

> However,

> that's weird, because that means currently in (a+b)*Matrix(?,2), (a+b)

> should not be evaluated while in my experiments I got similar performance

> than with explicit evaluation (and removing the condition in ei_nested was

> indeed slower). So I'm puzzled.

so should give bad performance in 2x2 matrix products, so I don't understand

why it still works. My best guess is that the compiler was able to optimize

that code. After all a 2x2 matrix is only 4 scalars, the compiler is used to

optimize such computations by introducing temporaries.

Now let's find the correct formula.

Suppose we are doing a product of an _expression_ L with r rows and n cols, by

an _expression_ R with n rows and c cols.

Take the following notations:

LC = L::CoeffReadCost

RC = R::CoeffReadCost

SC = NumTraits<Scalar>::ReadCost

MC = NumTraits<Scalar>::MulCost

AC = NumTraits<Scalar>::AddCost

We want to minimize the total cost of evaluating the product: all coefficients

of it.

If we don't evaluate L or R, then the total cost is:

r*c*n*(LC+RC+MC+AC)

here I am doing a small approximation: there are only n-1 Adds per computed

coeff in the product, but I round that to n. In the worst case (n=2, all

costs equal to 1) this replaces 7*r*c by 8*r*c which is not too bad!

If we evaluate R, the total cost becomes:

c*n*RC + r*c*n*(LC+SC+MC+AC)

hm... you forgot the store when you do the evaluation, then it is:

c*n*(RC+SC) + r*c*n*(LC+SC+MC+AC)

and then you exactly get the current formula:

(1) (r+1)*SC < (r-1) * RC

This formula also neglect the temporary creation overhead, but let's assume it's negligible.

So now we can wonder if this theoretical formula works well in practice or if by tweaking it we can get better result, for instance:

(2) (r+1)*SC <= (r-1) * RC

mimic the previous formula when we set SC=1 and RC>1. Other option is neglecting the store:

(3) r*SC < (r-1) * RC

So if we assumes SC=1, then (2) and (3) are exactly the same (oh !, as you said later, good)

Now, let's summarize what happens with a 2x2 matrix product wrt to the lhs _expression_ (SC=MC=AC=1, and cost(sin)=2):

(1) (2) or (3)

a+b lazy eval

2*a lazy lazy

sin(a) lazy eval

and for a 3x3 product:

(1) (2) or (3)

a+b eval eval

2*a lazy eval

sin(a) lazy eval

So definitely (1) does not look so good, in particular it implies to set a larger cost to sine (that might make sense), for the a+b and 2*a cases I'll write an exhaustive benchmark... If there is no obvious reason to eval a+b for a 2x2 product then it might be better to not eval since this allows the user to perform fine tuning for his specific case that is not possible if we do (abusive?) evaluation.

so it still matters whether or not RC is 2 or >2. The case RC=2 happens with

cwise unary ops e.g. (2*a)*b, etc...

right, assuming we go for (2) or (3) this allows to intuitively (experimentally?) fine tune nested evaluation per unary operator using a cost of 1 or 2 without affecting too much the unrolling (of course if the number of instruction of the operator is >1 use it !)

> my point was that maybe there exist a very smart way to right a loopOK I understand; this probably makes a lot of sense for expressions that have

> peeling unroller that would naturally handle all the situation, that's it !

> but I have not though so much about it....

the Like1DArrayBit (by the way, we should use that bit here to do a single

for loop instead of two for loops).

I actually use it for the vectorized assignment, but the pb is that you have to unpack the 1D index to 2D coordinates...unless we add _1Dcoeff(int) members to all "1DArray-able" expressions...

One more idea while I am there.

The current ei_corrected_flags stuff means that a Matrix type is different

from its own Eval type. That's bad, probably more code gets instantiated

uselessly because of that. I think Matrix should have Flags parameters

defaulting to ei_corrected_flags<...> so that Matrix<..> and its own Eval

type are guaranteed to be the same type. What's your opinion?

agree.

Gael.

**Follow-Ups**:**Re: [eigen] generic unrollers***From:*Benoît Jacob

**Re: [eigen] generic unrollers***From:*Benoît Jacob

**References**:**[eigen] generic unrollers***From:*Benoît Jacob

**Re: [eigen] generic unrollers***From:*Benoît Jacob

**Re: [eigen] generic unrollers***From:*Gael Guennebaud

**Re: [eigen] generic unrollers***From:*Benoît Jacob

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] generic unrollers** - Next by Date:
**Re: [eigen] generic unrollers** - Previous by thread:
**Re: [eigen] generic unrollers** - Next by thread:
**Re: [eigen] generic unrollers**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |