Re: [eigen] vectorization of complex

[ Thread Index | Date Index | More Archives ]

Some news from this fork.

Actually, achieving reasonably good performance required much more
rework of the product kernel that I thought. The result is that we now
have reasonably good performance for complex-complex matrix products,
but also high performance for mixed real-complex and complex-real
products. Since I doubt MKL supports mixed real-complex product
routines, I believe we are by far the fastest library for these tasks


Let me first recall that at the lowest blocking level, our product
kernel picks one small vertical panel of the RHS (e.g., 128xNR) and
one small horizontal panel of the LHS (e.g., MRx128). Then it computes
these MRxNR "dot products" at once while accumulating into registers
(the memory writes occur only at the end of this accumulation). The
values of MR and NR depends on the number of available registers
(e.g., for floats with 16 SSE registers we take MR=8 and NR=4). The
current small vertical panel of the RHS is reused many time for
several horizontal panels of the LHS. These dot products are
vectorized by computing packet * scalar products (e.g., [a_i+0 a_i+1
a_i+2 a_i+3] * b_j) that requires to unpack each rhs scalar b_j, e.g.:
[a_i+0 a_i+1 a_i+2 a_i+3] * [b_j b_j b_j b_j]. This unpacking is
costly and therefore it is amortized by pre-unpacking the small
vertical panel of the RHS into a temporary buffer.

Let's recall the complex-complex product formula:

(ar + i ai) * (br + i bi) = (ar*br - ai*bi) + i(ar*bi + ai*br)

For complex<double>, we store one scalar per SSE register ([ar,ai],
[br,bi]), and the full product can be vectorized by doing only 2 MADD
(mul+add), but 3 swizzling and one "bitwise or" operation to negate
some coefficients. This make it far to be optimal. In particular
swizzling is extremely costly. Moreover all these "complex" operations
requires many additional registers to store the temporaries.

So, the trick is to observe that a sum of N complex-complex products
can be computed with only 2*N MADD and a single one swizzling
operation by computing the sum of each of the four real-real products
separately. This requires 2 registers for the accumulation instead of

[sum(ar_k*br_k), sum(ai_k*br_k)]
[sum(ar_k*bi_k), sum(ai_k*bi_k)]

(As I do for real, I also pre-unpack [br_k,bi_k] as
[br_k,br_k],[bi_k,bi_k] into a small temporary buffer to amortize the
cost of this unpacking).

At the end of the register level accumulation, i.e., before
accumulating this sum into the destination matrix, we can easily
re-assemble these two registers to get the final result. Another
advantage is that we don't have to distinguish between the optional
conjugation states of the lhs and rhs during the accumulation, but
only at the final re-assembly stage.

This required a generalization of the product kernel that also allowed
me to vectorize real * complex products. This later is still not as
optimal as the complex*real case because it still requires N swizzling
on the LHS (recall the vectorization is not symmetric wrt the left and
right hand sides) Nevertheless, it is still faster than
complex-complex products, and faster than transposing the product to
be in the complex*real case.

Note that I still need more testing and cleaning before merging....



Mail converted by MHonArc 2.6.19+