Re: [eigen] generic unrollers

On Wed, Jun 11, 2008 at 11:06 AM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:

First of all: I realize that I sound like a bastard in English :) at least I
know I sound less friendly in English than in French, so just don't worry
about that :)

hm... I guess the same apply to me, so no pb :)

On Tuesday 10 June 2008 20:28:38 Gael Guennebaud wrote:
> On Tue, Jun 10, 2008 at 5:33 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:
> > On Monday 09 June 2008 16:15:33 Gael Guennebaud wrote:
> > > First, the unrolling decision should take into account the
> > > vectorization since for float it reduces the instruction count by a
> > > factor four, so the unrolling cost should be around: (4*4)*2 flops +
> > > 2*16 reads = 96 that is just bellow 100.
> >
> > Ah OK, good to know.
>
> but we still have to divide by the size of the packet...

Yes I understand! That's funny, this is a case where size and cost differ:
size gets divided by 4 theoretically, while the speed is typically multiplied
by 2.

I'm still fine for dividing by the size of the packet, at least that's a
natural choice. But is the code size really divided by that or is just a
theoretical maximum that's never reached? For example vectorized registers
are 128 bit so the load/store operations on them probably take more bytes,
no? And what about the opcodes, probably they all have at least 2 bytes?
Though, that may also be the case of x87 opcodes, I don't know. So, somehow I
was expecting 2x-3x to be more realistic than 4x.

I don't know about the size of the different opcodes but that's probably too much to reach such a refinement... That is important to me is that EIGEN_UNROLLING_LIMIT remains intuitive.
What can I say is that the number of instruction is indeed divided by 4: I checked benchmark.cpp with 4x4 float matrices and no unrolling limits: 84 lines vs 276 => /3.28. Yes for the product it's not /4 because there are r*c*n/4 additional instructions to move one scalar to the 4 components of a packet. If we neglect them we exactly get the /4 factor (well not exactly because there is the compiler in-between).

So the best would be to compute two cost value: one for the vectorized path, and one the classic. This could also allows to take into account for the cost of unaligned read/store if they occur in the instruction or whatever else refinement, but that'd be overkill IMO.

> pop pop pop.... you're damn right, the cost of basic operations is already
> 1, I was really sure about the 2.....

OK, after sending the email I told me that it had to be just that...

> However,
> that's weird, because that means currently in (a+b)*Matrix(?,2), (a+b)
> should not be evaluated while in my experiments I got similar performance
> than with explicit evaluation (and removing the condition in ei_nested was
> indeed slower). So I'm puzzled.

I'm puzzled too, because now I believe that our current formula is wrong and
so should give bad performance in 2x2 matrix products, so I don't understand
why it still works. My best guess is that the compiler was able to optimize
that code. After all a 2x2 matrix is only 4 scalars, the compiler is used to
optimize such computations by introducing temporaries.

Now let's find the correct formula.

Suppose we are doing a product of an _expression_ L with r rows and n cols, by
an _expression_ R with n rows and c cols.

Take the following notations:
LC = L::CoeffReadCost
RC = R::CoeffReadCost
SC = NumTraits<Scalar>::ReadCost
MC = NumTraits<Scalar>::MulCost
AC = NumTraits<Scalar>::AddCost

We want to minimize the total cost of evaluating the product: all coefficients
of it.

If we don't evaluate L or R, then the total cost is:
r*c*n*(LC+RC+MC+AC)

here I am doing a small approximation: there are only n-1 Adds per computed
coeff in the product, but I round that to n. In the worst case (n=2, all
costs equal to 1) this replaces 7*r*c by 8*r*c which is not too bad!

If we evaluate R, the total cost becomes:

c*n*RC + r*c*n*(LC+SC+MC+AC)

hm... you forgot the store when you do the evaluation, then it is:

c*n*(RC+SC) + r*c*n*(LC+SC+MC+AC)

and then you exactly get the current formula:

(1)     (r+1)*SC < (r-1) * RC

This formula also neglect the temporary creation overhead, but let's assume it's negligible.

So now we can wonder if this theoretical formula works well in practice or if by tweaking it we can get better result, for instance:

(2) (r+1)*SC <= (r-1) * RC

mimic the previous formula when we set SC=1 and RC>1. Other option is neglecting the store:

(3) r*SC < (r-1) * RC

So if we assumes SC=1, then (2) and (3) are exactly the same (oh !, as you said later, good)

Now, let's summarize what happens with a 2x2 matrix product wrt to the lhs _expression_ (SC=MC=AC=1, and cost(sin)=2):

                (1)      (2) or (3)
a+b         lazy       eval
2*a          lazy       lazy
sin(a)      lazy        eval

and for a 3x3 product:

                (1)      (2) or (3)
a+b         eval       eval
2*a          lazy      eval
sin(a)      lazy       eval

So definitely (1) does not look so good, in particular it implies to set a larger cost to sine (that might make sense), for the a+b and 2*a cases I'll write an exhaustive benchmark... If there is no obvious reason to eval a+b for a 2x2 product then it might be better to not eval since this allows the user to perform fine tuning for his specific case that is not possible if we do (abusive?) evaluation.

so it still matters whether or not RC is 2 or >2. The case RC=2 happens with
cwise unary ops e.g. (2*a)*b, etc...

right, assuming we go for (2) or (3) this allows to intuitively (experimentally?) fine tune nested evaluation per unary operator using a cost of 1 or 2 without affecting too much the unrolling (of course if the number of instruction of the operator is >1 use it !)

> my point was that maybe there exist a very smart way to right a loop
> peeling unroller that would naturally handle all the situation, that's it !
> but I have not though so much about it....

OK I understand; this probably makes a lot of sense for expressions that have
the Like1DArrayBit (by the way, we should use that bit here to do a single
for loop instead of two for loops).

I actually use it for the vectorized assignment, but the pb is that you have to unpack the 1D index to 2D coordinates...unless we add _1Dcoeff(int) members to all "1DArray-able" expressions...

One more idea while I am there.

The current ei_corrected_flags stuff means that a Matrix type is different
from its own Eval type. That's bad, probably more code gets instantiated
uselessly because of that. I think Matrix should have Flags parameters
defaulting to ei_corrected_flags<...> so that Matrix<..> and its own Eval
type are guaranteed to be the same type. What's your opinion?

agree.

Gael.