|Re: [eigen] generic unrollers|
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
First of all: I realize that I sound like a bastard in English :) at least I know I sound less friendly in English than in French, so just don't worry about that :) On Tuesday 10 June 2008 20:28:38 Gael Guennebaud wrote: > On Tue, Jun 10, 2008 at 5:33 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote: > > On Monday 09 June 2008 16:15:33 Gael Guennebaud wrote: > > > First, the unrolling decision should take into account the > > > vectorization since for float it reduces the instruction count by a > > > factor four, so the unrolling cost should be around: (4*4)*2 flops + > > > 2*16 reads = 96 that is just bellow 100. > > > > Ah OK, good to know. > > but we still have to divide by the size of the packet... Yes I understand! That's funny, this is a case where size and cost differ: size gets divided by 4 theoretically, while the speed is typically multiplied by 2. I'm still fine for dividing by the size of the packet, at least that's a natural choice. But is the code size really divided by that or is just a theoretical maximum that's never reached? For example vectorized registers are 128 bit so the load/store operations on them probably take more bytes, no? And what about the opcodes, probably they all have at least 2 bytes? Though, that may also be the case of x87 opcodes, I don't know. So, somehow I was expecting 2x-3x to be more realistic than 4x. > pop pop pop.... you're damn right, the cost of basic operations is already > 1, I was really sure about the 2..... OK, after sending the email I told me that it had to be just that... > However, > that's weird, because that means currently in (a+b)*Matrix(?,2), (a+b) > should not be evaluated while in my experiments I got similar performance > than with explicit evaluation (and removing the condition in ei_nested was > indeed slower). So I'm puzzled. I'm puzzled too, because now I believe that our current formula is wrong and so should give bad performance in 2x2 matrix products, so I don't understand why it still works. My best guess is that the compiler was able to optimize that code. After all a 2x2 matrix is only 4 scalars, the compiler is used to optimize such computations by introducing temporaries. Now let's find the correct formula. Suppose we are doing a product of an expression L with r rows and n cols, by an expression R with n rows and c cols. Take the following notations: LC = L::CoeffReadCost RC = R::CoeffReadCost SC = NumTraits<Scalar>::ReadCost MC = NumTraits<Scalar>::MulCost AC = NumTraits<Scalar>::AddCost We want to minimize the total cost of evaluating the product: all coefficients of it. If we don't evaluate L or R, then the total cost is: r*c*n*(LC+RC+MC+AC) here I am doing a small approximation: there are only n-1 Adds per computed coeff in the product, but I round that to n. In the worst case (n=2, all costs equal to 1) this replaces 7*r*c by 8*r*c which is not too bad! If we evaluate R, the total cost becomes: c*n*RC + r*c*n*(LC+SC+MC+AC) So this is smaller than the total-cost-without-evaluating-R if and only if: RC + r*SC < r*RC in other words: r*SC < (r-1) * RC In the case of a 2x2 matrix product, this amounts to 2*SC<RC, which is true in the case of (x+y)*z because then RC=3 (one add, two reads). So this formula should work fine for your example. By comparison, our current formula in ei_nested is: (r+1) * SC < (r-1) * RC I don't quite remember what was the rationale for replacing r by r+1; I seem to remember that initially I did r so I don't remember why we replaced it by r+1. So, revert to r? > > So I'll try again, and see what's the best: the current "<" or the > proposed: "<=" In this particular example replacing < by <= has the same effect as replacing r+1 by r ; in the general case I believe that replacing r+1 by r is the more natural thing to do. > > Anyway, at least my discussion help to see Now I think this discussion helped to find a real bug in ei_nested so that's very useful! > the effect of the choice of the > cost value with respect to the evaluation of nested arguments. basically, > setting the cost to something greater than the cost of a basic operation > will have the same effect on nesting whatever the value, so no need to > bother. Indeed, the exact RC almost doesn't matter, it is almost enough to know whether it's 1 or >1. But in the case r=2 it's a little different, with the corrected formula, r*SC < (r-1) * RC for r=2 it becomes 2*SC < RC so typically, 2 < RC so it still matters whether or not RC is 2 or >2. The case RC=2 happens with cwise unary ops e.g. (2*a)*b, etc... With the current ei_nested formula things were different as the barrier was 3... > my point was that maybe there exist a very smart way to right a loop > peeling unroller that would naturally handle all the situation, that's it ! > but I have not though so much about it.... OK I understand; this probably makes a lot of sense for expressions that have the Like1DArrayBit (by the way, we should use that bit here to do a single for loop instead of two for loops). For general matrix expressions we have to do two nested traversals so i think it's natural to look first at unrolling the inner one, then the outer one. Yes, if unrolling is not done then one can then consider peeling. > well, the additional template parameter will be a functor, so one more > indirection during the compilation... but we'll see, given the fact some > unrollers already take a functor as arguments.... but don't interpret it > wrong, I'm not at all against this idea :) OK I'll play with that and tell you. One more idea while I am there. The current ei_corrected_flags stuff means that a Matrix type is different from its own Eval type. That's bad, probably more code gets instantiated uselessly because of that. I think Matrix should have Flags parameters defaulting to ei_corrected_flags<...> so that Matrix<..> and its own Eval type are guaranteed to be the same type. What's your opinion? Benoit
Description: This is a digitally signed message part.
|Mail converted by MHonArc 2.6.19+||http://listengine.tuxfamily.org/|