Re: [eigen] generic unrollers

[ Thread Index | Date Index | More Archives ]

First of all: I realize that I sound like a bastard in English :) at least I 
know I sound less friendly in English than in French, so just don't worry 
about that :)

On Tuesday 10 June 2008 20:28:38 Gael Guennebaud wrote:
> On Tue, Jun 10, 2008 at 5:33 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:
> > On Monday 09 June 2008 16:15:33 Gael Guennebaud wrote:
> > > First, the unrolling decision should take into account the
> > > vectorization since for float it reduces the instruction count by a
> > > factor four, so the unrolling cost should be around: (4*4)*2 flops +
> > > 2*16 reads = 96 that is just bellow 100.
> >
> > Ah OK, good to know.
> but we still have to divide by the size of the packet...

Yes I understand! That's funny, this is a case where size and cost differ: 
size gets divided by 4 theoretically, while the speed is typically multiplied 
by 2.

I'm still fine for dividing by the size of the packet, at least that's a 
natural choice. But is the code size really divided by that or is just a 
theoretical maximum that's never reached? For example vectorized registers 
are 128 bit so the load/store operations on them probably take more bytes, 
no? And what about the opcodes, probably they all have at least 2 bytes? 
Though, that may also be the case of x87 opcodes, I don't know. So, somehow I 
was expecting 2x-3x to be more realistic than 4x. 

> pop pop pop.... you're damn right, the cost of basic operations is already
> 1, I was really sure about the 2.....

OK, after sending the email I told me that it had to be just that...

> However,
> that's weird, because that means currently in (a+b)*Matrix(?,2), (a+b)
> should not be evaluated while in my experiments I got similar performance
> than with explicit evaluation (and removing the condition in ei_nested was
> indeed slower). So I'm puzzled.

I'm puzzled too, because now I believe that our current formula is wrong and 
so should give bad performance in 2x2 matrix products, so I don't understand 
why it still works. My best guess is that the compiler was able to optimize 
that code. After all a 2x2 matrix is only 4 scalars, the compiler is used to 
optimize such computations by introducing temporaries.

Now let's find the correct formula.

Suppose we are doing a product of an expression L with r rows and n cols, by 
an expression R with n rows and c cols.

Take the following notations:
LC  = L::CoeffReadCost
RC  = R::CoeffReadCost
SC  = NumTraits<Scalar>::ReadCost
MC  = NumTraits<Scalar>::MulCost
AC  = NumTraits<Scalar>::AddCost

We want to minimize the total cost of evaluating the product: all coefficients 
of it.

If we don't evaluate L or R, then the total cost is:

here I am doing a small approximation: there are only n-1 Adds per computed 
coeff in the product, but I round that to n. In the worst case (n=2, all 
costs equal to 1) this replaces 7*r*c by 8*r*c which is not too bad!

If we evaluate R, the total cost becomes:

c*n*RC + r*c*n*(LC+SC+MC+AC)

So this is smaller than the total-cost-without-evaluating-R if and only if:

RC + r*SC < r*RC

in other words:

r*SC < (r-1) * RC

In the case of a 2x2 matrix product, this amounts to 2*SC<RC, which is true in 
the case of (x+y)*z because then RC=3 (one add, two reads). So this formula 
should work fine for your example.

By comparison, our current formula in ei_nested is:

(r+1) * SC < (r-1) * RC

I don't quite remember what was the rationale for replacing r by r+1; I seem 
to remember that initially I did r so I don't remember why we replaced it by 

So, revert to r?

> So I'll try again, and see what's the best: the current "<" or the
> proposed: "<="

In this particular example replacing < by <= has the same effect as replacing 
r+1 by r ; in the general case I believe that replacing r+1 by r is the more 
natural thing to do.

> Anyway, at least my discussion help to see

Now I think this discussion helped to find a real bug in ei_nested so that's 
very useful!

> the effect of the choice of the 
> cost value with respect to the evaluation of nested arguments. basically,
> setting the cost to something greater than the cost of a basic operation
> will have the same effect on nesting whatever the value, so no need to
> bother.

Indeed, the exact RC almost doesn't matter, it is almost enough to know 
whether it's 1 or >1.

But in the case r=2 it's a little different, with the corrected formula,
r*SC < (r-1) * RC

for r=2 it becomes

2*SC < RC

so typically,

2 < RC

so it still matters whether or not RC is 2 or >2. The case RC=2 happens with 
cwise unary ops e.g. (2*a)*b, etc...

With the current ei_nested formula things were different as the barrier was 

> my point was that maybe there exist a very smart way to right a loop
> peeling unroller that would naturally handle all the situation, that's it !
> but I have not though so much about it....

OK I understand; this probably makes a lot of sense for expressions that have 
the Like1DArrayBit (by the way, we should use that bit here to do a single 
for loop instead of two for loops).

For general matrix expressions we have to do two nested traversals so i think 
it's natural to look first at unrolling the inner one, then the outer one. 
Yes, if unrolling is not done then one can then consider peeling.

> well, the additional template parameter will be a functor, so one more
> indirection during the compilation... but we'll see, given the fact some
> unrollers already take a functor as arguments.... but don't interpret it
> wrong, I'm not at all against this idea :)

OK I'll play with that and tell you.

One more idea while I am there.

The current ei_corrected_flags stuff means that a Matrix type is different 
from its own Eval type. That's bad, probably more code gets instantiated 
uselessly because of that. I think Matrix should have Flags parameters 
defaulting to ei_corrected_flags<...> so that Matrix<..> and its own Eval 
type are guaranteed to be the same type. What's your opinion?


Attachment: signature.asc
Description: This is a digitally signed message part.

Mail converted by MHonArc 2.6.19+