Re: [eigen] news on the refactoring of the expression template mechanism |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
- Subject: Re: [eigen] news on the refactoring of the expression template mechanism
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Fri, 21 Feb 2014 23:49:28 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=xMtSu/rr73+OSbaCjlAk8JZSFGCX7jRbc3GXxp7ZPQs=; b=d0cGff+H8XuZO5mXy1+VWSlF9hpEAyAu4qgjHZeMX9z0JmDCawh9CQVZcEkLxCa4pE BpJm4l6MaMdnj9TX2bFtOEJwNSmEP1YnRPICgrjKmaheZZ+WhAFWVtiHfHiA1EtJHQJK UCyc6qruXI9QJuJc7hn/Nk7FhWpXuV0B1kDMmqE3gmsoSV5q57OKyBQoOqUg74foL+yh Z3G2uv25xrgTVJkY4ynHFMONbYX0yOOfr04lOYMTFMeCNqIFiKkJBkBKkV1QEQI6/rDn GY7/WIlSeHZdKlO/pLkoN+7E2iDmmtwpuaH2L1m4y6v89c3jZPoBj0+moED6/PkT7mUJ 2WUg==
On Fri, Feb 21, 2014 at 5:48 PM, Christoph Hertzberg
<chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> very nice work!
>
> I did not walk through the source so far, so (some of) my
> questions/suggestions/comments might actually be trivial.
>
>
> About compound assignment operators (p14):
> Will this be extendable/specializable for FMA-Support? This could save
> operations for array expressions such as A+=B*C; (and we already have a
> pmadd packet function).
This could be done by specializing:
Assignment<Dst,CwiseBinaryOp<scalar_prod_op,B,C>, add_assign_op, Dense2Dense>
as well as:
evaluator<CwiseBinaryOp<scalar_prod_op,A,CwiseBinaryOp<scalar_prod_op,B,C>>>
for A+B*C.
> Regarding Temporaries (all pseudo-codes should be extendable to
> packet-code):
>
> p12: [ (a+b)*c ]
> Assume a, b are JxN and c is NxK with very large N but small J and K,
> especially min(J,K)==1. I think storing (a+b) to a temporary could/should
> theoretically be avoided (pseudo-code):
>>
>> for(int i=0; i<N; ++i) {
>> Result += (a+b).col(i) * c.row(i); // each element of (a+b) and c is
>> accessed only once at the cost of accessing Result multiple times
>> }
This is already the case if c is a vector. If a and b are row vectors
but c is a matrix, then a+b is evaluated into a temp. In theory this
could not be needed, but we have to find a tradeoff between the
genericity of the product kernels and their number of instantiation...
Product kernels are heavy weighted. We could evaluate (a+b) per small
chunks though...
> p20: [ (A+B).normalized() ]
> Theoretically, the norm could be accumulated while the expression is
> evaluated into the temporary, saving one walk through the vector.
> Furthermore, for this example, usually the result vector could be used to
> store the temporary (saving cache-accesses).
> Pseudo-code:
>>
>> norm2 = 0;
>> for(int i=0; i<N: ++i) {
>> temp = A(i)+B(i)
>> res(i) = temp;
>> norm2 += temp*temp;
>> }
>> normInv = 1/sqrt(norm2)
>> for(int i=0; i<N; ++i) {
>> res(i) *= normInv;
>> }
This is typically something that could be possible with the new
generic "kernels", but I don't known if that's worth the effort.
> Finally, regarding vectorization of (partially) unaligned objects:
> I think it would be nice, if we could somehow determine the coeff-read/write
> cost and the packet-read/write cost of the src and dst and decide at
> compile-time which path is more efficient (c.f. bug 256). I'm aware that's a
> bit wishful thinking, since for many expressions the costs are hard to
> determine and they also strongly depend on the target architecture (if we
> trust Agner's instruction timings [1] there is no difference between aligned
> and unaligned loads/stores anymore on IvyBridge/SandyBridge/Haswell,
> compared to a factor 2 or 4 difference on Wolfdale/Nehalem).
yes that's a planed feature, and actually a needed feature when we'll
have determine which packet size is best when multiple sizes are
possibles.
> Nothing of the above is a mandatory optimization at the moment, but it would
> be nice if all/most of them will be implementable without major
> refactorings.
Sure!
gael