On Fri, Feb 21, 2014 at 5:48 PM, Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > > very nice work! > > I did not walk through the source so far, so (some of) my > questions/suggestions/comments might actually be trivial. > > > About compound assignment operators (p14): > Will this be extendable/specializable for FMA-Support? This could save > operations for array expressions such as A+=B*C; (and we already have a > pmadd packet function). This could be done by specializing: Assignment<Dst,CwiseBinaryOp<scalar_prod_op,B,C>, add_assign_op, Dense2Dense> as well as: evaluator<CwiseBinaryOp<scalar_prod_op,A,CwiseBinaryOp<scalar_prod_op,B,C>>> for A+B*C. > Regarding Temporaries (all pseudo-codes should be extendable to > packet-code): > > p12: [ (a+b)*c ] > Assume a, b are JxN and c is NxK with very large N but small J and K, > especially min(J,K)==1. I think storing (a+b) to a temporary could/should > theoretically be avoided (pseudo-code): >> >> for(int i=0; i<N; ++i) { >> Result += (a+b).col(i) * c.row(i); // each element of (a+b) and c is >> accessed only once at the cost of accessing Result multiple times >> } This is already the case if c is a vector. If a and b are row vectors but c is a matrix, then a+b is evaluated into a temp. In theory this could not be needed, but we have to find a tradeoff between the genericity of the product kernels and their number of instantiation... Product kernels are heavy weighted. We could evaluate (a+b) per small chunks though... > p20: [ (A+B).normalized() ] > Theoretically, the norm could be accumulated while the expression is > evaluated into the temporary, saving one walk through the vector. > Furthermore, for this example, usually the result vector could be used to > store the temporary (saving cache-accesses). > Pseudo-code: >> >> norm2 = 0; >> for(int i=0; i<N: ++i) { >> temp = A(i)+B(i) >> res(i) = temp; >> norm2 += temp*temp; >> } >> normInv = 1/sqrt(norm2) >> for(int i=0; i<N; ++i) { >> res(i) *= normInv; >> } This is typically something that could be possible with the new generic "kernels", but I don't known if that's worth the effort. > Finally, regarding vectorization of (partially) unaligned objects: > I think it would be nice, if we could somehow determine the coeff-read/write > cost and the packet-read/write cost of the src and dst and decide at > compile-time which path is more efficient (c.f. bug 256). I'm aware that's a > bit wishful thinking, since for many expressions the costs are hard to > determine and they also strongly depend on the target architecture (if we > trust Agner's instruction timings [1] there is no difference between aligned > and unaligned loads/stores anymore on IvyBridge/SandyBridge/Haswell, > compared to a factor 2 or 4 difference on Wolfdale/Nehalem). yes that's a planed feature, and actually a needed feature when we'll have determine which packet size is best when multiple sizes are possibles. > Nothing of the above is a mandatory optimization at the moment, but it would > be nice if all/most of them will be implementable without major > refactorings. Sure! gael

