Re: [eigen] news on the refactoring of the expression template mechanism

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] news on the refactoring of the expression template mechanism
From: Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 21 Feb 2014 17:48:01 +0100

Hi,

very nice work!

I did not walk through the source so far, so (some of) myquestions/suggestions/comments might actually be trivial.



About compound assignment operators (p14):

Will this be extendable/specializable for FMA-Support? This could saveoperations for array expressions such as A+=B*C; (and we already have apmadd packet function).

Regarding Temporaries (all pseudo-codes should be extendable topacket-code):


p12: [ (a+b)*c ]

Assume a, b are JxN and c is NxK with very large N but small J and K,especially min(J,K)==1. I think storing (a+b) to a temporarycould/should theoretically be avoided (pseudo-code):

  for(int i=0; i<N; ++i) {
    Result += (a+b).col(i) * c.row(i); // each element of (a+b) and c is accessed only once at the cost of accessing Result multiple times
  }


p20: [ (A+B).normalized() ]

Theoretically, the norm could be accumulated while the expression isevaluated into the temporary, saving one walk through the vector.Furthermore, for this example, usually the result vector could be usedto store the temporary (saving cache-accesses).

Pseudo-code:

  norm2 = 0;
  for(int i=0; i<N: ++i) {
    temp = A(i)+B(i)
    res(i) = temp;
    norm2 += temp*temp;
  }
  normInv = 1/sqrt(norm2)
  for(int i=0; i<N; ++i) {
    res(i) *= normInv;
  }

Of course, for a direct-access source, we do not need thistemp-mechanism at all, but directly accumulate the norm from the sourceand evaluate the scaled source to the dst (that's exactly what your codewould do, I assume).



Finally, regarding vectorization of (partially) unaligned objects:

I think it would be nice, if we could somehow determine thecoeff-read/write cost and the packet-read/write cost of the src and dstand decide at compile-time which path is more efficient (c.f. bug 256).I'm aware that's a bit wishful thinking, since for many expressions thecosts are hard to determine and they also strongly depend on the targetarchitecture (if we trust Agner's instruction timings [1] there is nodifference between aligned and unaligned loads/stores anymore onIvyBridge/SandyBridge/Haswell, compared to a factor 2 or 4 difference onWolfdale/Nehalem).

Nothing of the above is a mandatory optimization at the moment, but itwould be nice if all/most of them will be implementable without majorrefactorings.

I hope I'll find the time to look through your source and give moreconstructive comments, as well :)




[1] http://www.agner.org/optimize/instruction_tables.pdf



--
----------------------------------------------
Dipl.-Inf., Dipl.-Math. Christoph Hertzberg
Cartesium 0.049
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen

Tel: +49 (421) 218-64252
----------------------------------------------

Follow-Ups:
- Re: [eigen] news on the refactoring of the expression template mechanism
  - From: Gael Guennebaud

References:
- [eigen] news on the refactoring of the expression template mechanism
  - From: Gael Guennebaud

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] news on the refactoring of the expression template mechanism
Next by Date: Re: [eigen] news on the refactoring of the expression template mechanism
Previous by thread: Re: [eigen] news on the refactoring of the expression template mechanism
Next by thread: Re: [eigen] news on the refactoring of the expression template mechanism

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/