Re: [eigen] generic unrollers

On Mon, Jun 9, 2008 at 8:50 AM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:

I measured the impact of reducing the unrolling limit from 400 to 100, and it
does have a real cost as 4x4 matrix multiplication is no longer unrolled.
Yet, I admit that 400 was too much. The solution is to -- at last -- do
partial loop unrolling i.e. unroll the inner direction only.

ah that's right 100 is too low for a 4x4 matrix product :(
Anyway, I have a couple of remarks about this issue:

First, the unrolling decision should take into account the vectorization since for float it reduces the instruction count by a factor four, so the unrolling cost should be around: (4*4)*2 flops + 2*16 reads = 96 that is just bellow 100.

Secondly, this also makes me thinking again that maybe it could be enough to approximately count the number of instructions rather than a pseudo evaluation cost. Indeed, this makes the unrolling decision more accurate (unless I miss something) and simpler to control intuitively. On the other hand, the cost model is also used to decide of the automatic evaluation of the arguments via ei_nested<>. In the following I'll try to prove that the same behavior for ei_nested can be achieved using a simplified model where the cost is actually the number of instructions.

ei_nested uses the reading cost of a nested coeff C and the number of evaluations N of the same coefficients to make the decision to evaluate or not the argument. The condition looks like this:
   (N+1) * R < (N-1) * C
where R is the cost to read or write a scalar. Let's assume R=1 and the cost of an arithmetic operation is >= 2, then it is obvious that if the nested _expression_ contains at least one arithmetic operation and N>=3 the _expression_ will we always evaluated.
The tricky case is when N=2, that leads to the condition:
    3*R < C
This means that any binary _expression_ will be always evaluated (for N>=2) since its cost is 2R + cost_of_ the_op and cost_of_the_op is always >= 2.
The cost of a unary operation is only R + cost_of_ the_op and so such _expression_ will be evaluated if and only if the operation has a cost greater than a basic arithmetic operation.

After a few experiments it appears that this empirical model works quite well, for instance
   (m1+m2)*Matrix(?,2)
will leads to an evaluation that is good, while:
   (2*m1)*Matrix(?,2)
won't, that appears to be the good choice.

This behavior can be reproduced by replacing the condition by:
   (N+1) <= (N-1) * C'
where C' is the number of instructions needed to evaluate the coefficient: the differences are that the cost of a basic arithmetic operation is 1 instead of 2 and "<" is replaced by "<=". This simplified model also assumes that a complex operation will be associated more than a single instruction (like sin, cos, pow). To summarize, what really matters is that a basic arithmetic operation has the cost 1 (single instruction) while other has a cost > 1, where the value plays only a role to the unrolling process.

We still have to consider the case of std::complex but I don't expect any trouble for them.

Third remark: instead of the three options:
1 - full unrolling
2 - inner loop unrolling
3 - no unrolling (at least no explicit unrolling)
what about trying to support loop peeling (partial unrolling of dynamic loops) which ultimately could leads to a full unrolling. Of course this is not so easy, but the 3 options solution is not very simple too (must take into account the largest/smallest dimensions, etc.) and quite more limited. IMO this is worth trying.

Since this is making the unrollers even more complicated, and since we now
have several different places in eigen with such unrollers (Assign.h,
Redux.h, Part.h, Visitor.h) I think it's now more than time to move to
generic unrollers. These would take as template argument a struct providing
the method to apply to each coeff, etc.

OK that I do it?

would be awesome ! though I'm a bit worried about the impact on compilation time.... But it's definitely worth trying !

Gael.