Re: [eigen] news on the refactoring of the expression template mechanism |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
Hi,
very nice work!
I did not walk through the source so far, so (some of) my
questions/suggestions/comments might actually be trivial.
About compound assignment operators (p14):
Will this be extendable/specializable for FMA-Support? This could save
operations for array expressions such as A+=B*C; (and we already have a
pmadd packet function).
Regarding Temporaries (all pseudo-codes should be extendable to
packet-code):
p12: [ (a+b)*c ]
Assume a, b are JxN and c is NxK with very large N but small J and K,
especially min(J,K)==1. I think storing (a+b) to a temporary
could/should theoretically be avoided (pseudo-code):
for(int i=0; i<N; ++i) {
Result += (a+b).col(i) * c.row(i); // each element of (a+b) and c is accessed only once at the cost of accessing Result multiple times
}
p20: [ (A+B).normalized() ]
Theoretically, the norm could be accumulated while the expression is
evaluated into the temporary, saving one walk through the vector.
Furthermore, for this example, usually the result vector could be used
to store the temporary (saving cache-accesses).
Pseudo-code:
norm2 = 0;
for(int i=0; i<N: ++i) {
temp = A(i)+B(i)
res(i) = temp;
norm2 += temp*temp;
}
normInv = 1/sqrt(norm2)
for(int i=0; i<N; ++i) {
res(i) *= normInv;
}
Of course, for a direct-access source, we do not need this
temp-mechanism at all, but directly accumulate the norm from the source
and evaluate the scaled source to the dst (that's exactly what your code
would do, I assume).
Finally, regarding vectorization of (partially) unaligned objects:
I think it would be nice, if we could somehow determine the
coeff-read/write cost and the packet-read/write cost of the src and dst
and decide at compile-time which path is more efficient (c.f. bug 256).
I'm aware that's a bit wishful thinking, since for many expressions the
costs are hard to determine and they also strongly depend on the target
architecture (if we trust Agner's instruction timings [1] there is no
difference between aligned and unaligned loads/stores anymore on
IvyBridge/SandyBridge/Haswell, compared to a factor 2 or 4 difference on
Wolfdale/Nehalem).
Nothing of the above is a mandatory optimization at the moment, but it
would be nice if all/most of them will be implementable without major
refactorings.
I hope I'll find the time to look through your source and give more
constructive comments, as well :)
[1] http://www.agner.org/optimize/instruction_tables.pdf
--
----------------------------------------------
Dipl.-Inf., Dipl.-Math. Christoph Hertzberg
Cartesium 0.049
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen
Tel: +49 (421) 218-64252
----------------------------------------------