Hi,
very nice work!

`I did not walk through the source so far, so (some of) my
``questions/suggestions/comments might actually be trivial.
`
About compound assignment operators (p14):

`Will this be extendable/specializable for FMA-Support? This could save
``operations for array expressions such as A+=B*C; (and we already have a
``pmadd packet function).
`

`Regarding Temporaries (all pseudo-codes should be extendable to
``packet-code):
`
p12: [ (a+b)*c ]

`Assume a, b are JxN and c is NxK with very large N but small J and K,
``especially min(J,K)==1. I think storing (a+b) to a temporary
``could/should theoretically be avoided (pseudo-code):
` for(int i=0; i<N; ++i) {
Result += (a+b).col(i) * c.row(i); // each element of (a+b) and c is accessed only once at the cost of accessing Result multiple times
}

p20: [ (A+B).normalized() ]

`Theoretically, the norm could be accumulated while the expression is
``evaluated into the temporary, saving one walk through the vector.
``Furthermore, for this example, usually the result vector could be used
``to store the temporary (saving cache-accesses).
`Pseudo-code:

norm2 = 0;
for(int i=0; i<N: ++i) {
temp = A(i)+B(i)
res(i) = temp;
norm2 += temp*temp;
}
normInv = 1/sqrt(norm2)
for(int i=0; i<N; ++i) {
res(i) *= normInv;
}

`Of course, for a direct-access source, we do not need this
``temp-mechanism at all, but directly accumulate the norm from the source
``and evaluate the scaled source to the dst (that's exactly what your code
``would do, I assume).
`
Finally, regarding vectorization of (partially) unaligned objects:

`I think it would be nice, if we could somehow determine the
``coeff-read/write cost and the packet-read/write cost of the src and dst
``and decide at compile-time which path is more efficient (c.f. bug 256).
``I'm aware that's a bit wishful thinking, since for many expressions the
``costs are hard to determine and they also strongly depend on the target
``architecture (if we trust Agner's instruction timings [1] there is no
``difference between aligned and unaligned loads/stores anymore on
``IvyBridge/SandyBridge/Haswell, compared to a factor 2 or 4 difference on
``Wolfdale/Nehalem).
`

`Nothing of the above is a mandatory optimization at the moment, but it
``would be nice if all/most of them will be implementable without major
``refactorings.
`

`I hope I'll find the time to look through your source and give more
``constructive comments, as well :)
`
[1] http://www.agner.org/optimize/instruction_tables.pdf
--
----------------------------------------------
Dipl.-Inf., Dipl.-Math. Christoph Hertzberg
Cartesium 0.049
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen
Tel: +49 (421) 218-64252
----------------------------------------------