Re: [eigen] vectorization of complex

[ Thread Index | Date Index | More Archives ]

2010/7/20 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> Some news from this fork.
> Actually, achieving reasonably good performance required much more
> rework of the product kernel that I thought. The result is that we now
> have reasonably good performance for complex-complex matrix products,
> but also high performance for mixed real-complex and complex-real
> products. Since I doubt MKL supports mixed real-complex product
> routines, I believe we are by far the fastest library for these tasks
> :)
> Explanation:
> Let me first recall that at the lowest blocking level, our product
> kernel picks one small vertical panel of the RHS (e.g., 128xNR) and
> one small horizontal panel of the LHS (e.g., MRx128). Then it computes
> these MRxNR "dot products" at once while accumulating into registers
> (the memory writes occur only at the end of this accumulation). The
> values of MR and NR depends on the number of available registers
> (e.g., for floats with 16 SSE registers we take MR=8 and NR=4). The
> current small vertical panel of the RHS is reused many time for
> several horizontal panels of the LHS. These dot products are
> vectorized by computing packet * scalar products (e.g., [a_i+0 a_i+1
> a_i+2 a_i+3] * b_j) that requires to unpack each rhs scalar b_j, e.g.:
> [a_i+0 a_i+1 a_i+2 a_i+3] * [b_j b_j b_j b_j]. This unpacking is
> costly and therefore it is amortized by pre-unpacking the small
> vertical panel of the RHS into a temporary buffer.
> Let's recall the complex-complex product formula:
> (ar + i ai) * (br + i bi) = (ar*br - ai*bi) + i(ar*bi + ai*br)
> For complex<double>, we store one scalar per SSE register ([ar,ai],
> [br,bi]), and the full product can be vectorized by doing only 2 MADD
> (mul+add), but 3 swizzling and one "bitwise or" operation to negate
> some coefficients. This make it far to be optimal. In particular
> swizzling is extremely costly. Moreover all these "complex" operations
> requires many additional registers to store the temporaries.
> So, the trick is to observe that a sum of N complex-complex products
> can be computed with only 2*N MADD and a single one swizzling
> operation by computing the sum of each of the four real-real products
> separately. This requires 2 registers for the accumulation instead of
> one:
> [sum(ar_k*br_k), sum(ai_k*br_k)]
> [sum(ar_k*bi_k), sum(ai_k*bi_k)]

Wow, this is really smart.

I think that all 200 people here must be very admirative and thankful
for all your products work! At least I am.


> (As I do for real, I also pre-unpack [br_k,bi_k] as
> [br_k,br_k],[bi_k,bi_k] into a small temporary buffer to amortize the
> cost of this unpacking).
> At the end of the register level accumulation, i.e., before
> accumulating this sum into the destination matrix, we can easily
> re-assemble these two registers to get the final result. Another
> advantage is that we don't have to distinguish between the optional
> conjugation states of the lhs and rhs during the accumulation, but
> only at the final re-assembly stage.
> This required a generalization of the product kernel that also allowed
> me to vectorize real * complex products. This later is still not as
> optimal as the complex*real case because it still requires N swizzling
> on the LHS (recall the vectorization is not symmetric wrt the left and
> right hand sides) Nevertheless, it is still faster than
> complex-complex products, and faster than transposing the product to
> be in the complex*real case.
> Note that I still need more testing and cleaning before merging....
> cheers,
> gael

Mail converted by MHonArc 2.6.19+