Re: [eigen] news on the refactoring of the expression template mechanism

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi,

very nice work!

I did not walk through the source so far, so (some of) my questions/suggestions/comments might actually be trivial.


About compound assignment operators (p14):
Will this be extendable/specializable for FMA-Support? This could save operations for array expressions such as A+=B*C; (and we already have a pmadd packet function).


Regarding Temporaries (all pseudo-codes should be extendable to packet-code):

p12: [ (a+b)*c ]
Assume a, b are JxN and c is NxK with very large N but small J and K, especially min(J,K)==1. I think storing (a+b) to a temporary could/should theoretically be avoided (pseudo-code):
  for(int i=0; i<N; ++i) {
    Result += (a+b).col(i) * c.row(i); // each element of (a+b) and c is accessed only once at the cost of accessing Result multiple times
  }

p20: [ (A+B).normalized() ]
Theoretically, the norm could be accumulated while the expression is evaluated into the temporary, saving one walk through the vector. Furthermore, for this example, usually the result vector could be used to store the temporary (saving cache-accesses).
Pseudo-code:
  norm2 = 0;
  for(int i=0; i<N: ++i) {
    temp = A(i)+B(i)
    res(i) = temp;
    norm2 += temp*temp;
  }
  normInv = 1/sqrt(norm2)
  for(int i=0; i<N; ++i) {
    res(i) *= normInv;
  }

Of course, for a direct-access source, we do not need this temp-mechanism at all, but directly accumulate the norm from the source and evaluate the scaled source to the dst (that's exactly what your code would do, I assume).


Finally, regarding vectorization of (partially) unaligned objects:
I think it would be nice, if we could somehow determine the coeff-read/write cost and the packet-read/write cost of the src and dst and decide at compile-time which path is more efficient (c.f. bug 256). I'm aware that's a bit wishful thinking, since for many expressions the costs are hard to determine and they also strongly depend on the target architecture (if we trust Agner's instruction timings [1] there is no difference between aligned and unaligned loads/stores anymore on IvyBridge/SandyBridge/Haswell, compared to a factor 2 or 4 difference on Wolfdale/Nehalem).



Nothing of the above is a mandatory optimization at the moment, but it would be nice if all/most of them will be implementable without major refactorings.




I hope I'll find the time to look through your source and give more constructive comments, as well :)



[1] http://www.agner.org/optimize/instruction_tables.pdf



--
----------------------------------------------
Dipl.-Inf., Dipl.-Math. Christoph Hertzberg
Cartesium 0.049
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen

Tel: +49 (421) 218-64252
----------------------------------------------



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/