Re: [eigen] generic unrollers

[ Thread Index | Date Index | More Archives ]

On Monday 09 June 2008 16:15:33 Gael Guennebaud wrote:
> First, the unrolling decision should take into account the vectorization
> since for float it reduces the instruction count by a factor four, so the
> unrolling cost should be around: (4*4)*2 flops + 2*16 reads = 96 that is
> just bellow 100.

Ah OK, good to know.

> Secondly, this also makes me thinking again that maybe it could be enough
> to approximately count the number of instructions rather than a pseudo
> evaluation cost. Indeed, this makes the unrolling decision more accurate 
> (unless I miss something) and simpler to control intuitively.

In fact it is a dilemma. For a perfect solution we would need to know both the 
code size and the computation cost. Indeed, unrolling is trading code size 
for a speed improvement; and the relative speed improvement is bigger if the 
computation cost is smaller.

Since in NumTraits we basically set all costs to 1, it means that we 
implicitly say that code-size is proportional to cost anyway. It is still 
useful IMO to call that Cost, and assign a higher value to operations such as 

> On the other 
> hand, the cost model is also used to decide of the automatic evaluation of
> the arguments via ei_nested<>. In the following I'll try to prove that the
> same behavior for ei_nested can be achieved using a simplified model where
> the cost is actually the number of instructions.
> ei_nested uses the reading cost of a nested coeff C and the number of
> evaluations N of the same coefficients to make the decision to evaluate or
> not the argument. The condition looks like this:
>    (N+1) * R < (N-1) * C
> where R is the cost to read or write a scalar. Let's assume R=1 and the
> cost of an arithmetic operation is >= 2, then it is obvious that if the
> nested expression contains at least one arithmetic operation and N>=3 the
> expression will we always evaluated.
> The tricky case is when N=2, that leads to the condition:
>     3*R < C
> This means that any binary expression will be always evaluated (for N>=2)
> since its cost is 2R + cost_of_ the_op and cost_of_the_op is always >= 2.
> The cost of a unary operation is only R + cost_of_ the_op and so such
> expression will be evaluated if and only if the operation has a cost
> greater than a basic arithmetic operation.
> After a few experiments it appears that this empirical model works quite
> well, for instance
>    (m1+m2)*Matrix(?,2)
> will leads to an evaluation that is good, while:
>    (2*m1)*Matrix(?,2)
> won't, that appears to be the good choice.
> This behavior can be reproduced by replacing the condition by:
>    (N+1)  <= (N-1) * C'
> where C' is the number of instructions needed to evaluate the coefficient:
> the differences are that the cost of a basic arithmetic operation is 1
> instead of 2

Here I don't understand: looking at NumTraits.h you already set the cost of 
basic operations to 1, not 2.

Anyway I don't understand how the formula that you propose is simpler than 
what we are currently doing. Removing the multiplication by R is a tiny 
simplification and C' will be computed recursively much in the same way as C 
currently is... so I don't understand what will be the benefit of this 

The benefit of the current situation is that the formula
    (N+1) * R < (N-1) * C
has a clear meaning, it means exactly "is it more expensive to evaluate and 
read N times the result, or to read N times the expression itself?"

> and "<" is replaced by "<=". This simplified model also 
> assumes that a complex operation will be associated more than a single
> instruction (like sin, cos, pow).

So, C' is really a measure of cost, not of code size -- so C' is really the 
same thing as C currently is... or am I getting something wrong? Indeed since 
the costs of arithmetic ops in NumTraits are set to 1 currently, for 
expressions with standard arithmetic ops the current C is just the flops 
count. Where code-size and cost differ is with complex ops such as sqrt() 
where code-size=1 and cost=big, currently C is big, and if I understand you 
well you are proposing C'=big too, so that's the same thing!

> We still have to consider the case of std::complex but I don't expect any
> trouble for them.

Indeed: the costs currently specified in NumTraits explicitly compute the 
costs of complex in terms of the costs of the real type. This makes sense 
both code-size-wise and cost-wise.

> Third remark: instead of the three options:
>  1 - full unrolling
>  2 - inner loop unrolling
>  3 - no unrolling (at least no explicit unrolling)
> what about trying to support loop peeling (partial unrolling of dynamic
> loops) which ultimately could leads to a full unrolling. Of course this is
> not so easy, but the 3 options solution is not very simple too (must take
> into account the largest/smallest dimensions, etc.) and quite more limited.
> IMO this is worth trying.

Sure, but I regard loop peeling as a separate problem than inner loop 
unrolling. I'd go for first choosing one of the three options 1,2,3 you list 
above, and then for these loops that are not yet explicitly unrolled (in 
cases 2,3, which is always in the case of dynamic-size) consider loop 

> > Since this is making the unrollers even more complicated, and since we
> > now have several different places in eigen with such unrollers (Assign.h,
> > Redux.h, Part.h, Visitor.h) I think it's now more than time to move to
> > generic unrollers. These would take as template argument a struct
> > providing the method to apply to each coeff, etc.
> >
> > OK that I do it?
> would be awesome ! though I'm a bit worried about the impact on compilation
> time.... But it's definitely worth trying !

My rationale is that the unrollers already take many template parameters, so 
one more parameter is not too bad! But yes the only way to tell is to 
actually measure it.


Attachment: signature.asc
Description: This is a digitally signed message part.

Mail converted by MHonArc 2.6.19+