Re: [eigen] news on the refactoring of the expression template mechanism

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Fri, Feb 21, 2014 at 5:48 PM, Christoph Hertzberg
<chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> very nice work!
>
> I did not walk through the source so far, so (some of) my
> questions/suggestions/comments might actually be trivial.
>
>
> About compound assignment operators (p14):
> Will this be extendable/specializable for FMA-Support? This could save
> operations for array expressions such as A+=B*C; (and we already have a
> pmadd packet function).

This could be done by specializing:

Assignment<Dst,CwiseBinaryOp<scalar_prod_op,B,C>, add_assign_op, Dense2Dense>

as well as:

 evaluator<CwiseBinaryOp<scalar_prod_op,A,CwiseBinaryOp<scalar_prod_op,B,C>>>

for A+B*C.


> Regarding Temporaries (all pseudo-codes should be extendable to
> packet-code):
>
> p12: [ (a+b)*c ]
> Assume a, b are JxN and c is NxK with very large N but small J and K,
> especially min(J,K)==1. I think storing (a+b) to a temporary could/should
> theoretically be avoided (pseudo-code):
>>
>>   for(int i=0; i<N; ++i) {
>>     Result += (a+b).col(i) * c.row(i); // each element of (a+b) and c is
>> accessed only once at the cost of accessing Result multiple times
>>   }

This is already the case if c is a vector. If a and b are row vectors
but c is a matrix, then a+b is evaluated into a temp. In theory this
could not be needed, but we have to find a tradeoff between the
genericity of the product kernels and their number of instantiation...
Product kernels are heavy weighted. We could evaluate (a+b) per small
chunks though...

> p20: [ (A+B).normalized() ]
> Theoretically, the norm could be accumulated while the expression is
> evaluated into the temporary, saving one walk through the vector.
> Furthermore, for this example, usually the result vector could be used to
> store the temporary (saving cache-accesses).
> Pseudo-code:
>>
>>   norm2 = 0;
>>   for(int i=0; i<N: ++i) {
>>     temp = A(i)+B(i)
>>     res(i) = temp;
>>     norm2 += temp*temp;
>>   }
>>   normInv = 1/sqrt(norm2)
>>   for(int i=0; i<N; ++i) {
>>     res(i) *= normInv;
>>   }

This is typically something that could be possible with the new
generic "kernels", but I don't known if that's worth the effort.

> Finally, regarding vectorization of (partially) unaligned objects:
> I think it would be nice, if we could somehow determine the coeff-read/write
> cost and the packet-read/write cost of the src and dst and decide at
> compile-time which path is more efficient (c.f. bug 256). I'm aware that's a
> bit wishful thinking, since for many expressions the costs are hard to
> determine and they also strongly depend on the target architecture (if we
> trust Agner's instruction timings [1] there is no difference between aligned
> and unaligned loads/stores anymore on IvyBridge/SandyBridge/Haswell,
> compared to a factor 2 or 4 difference on Wolfdale/Nehalem).

yes that's a planed feature, and actually a needed feature when we'll
have determine which packet size is best when multiple sizes are
possibles.


> Nothing of the above is a mandatory optimization at the moment, but it would
> be nice if all/most of them will be implementable without major
> refactorings.

Sure!

gael



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/