Re: [eigen] vectorization of complex |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] vectorization of complex*From*: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>*Date*: Tue, 20 Jul 2010 11:47:07 +0200*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=9OOpOd9gkLNlg/eAU/X7HVRfZFsaqCaSxhBODSl/BMc=; b=fiLXSmcXKAu3AFLqSupN+V6mPQ9UESTjxN5afip6v2YjhxCPCsxMWKvOn7W8E/UbVN c3+9oMk4XDrwLaljDcTVj9EeXoKA9190B09aiK/jpiIbNNmNjdN2LUMfRKycveNw41W/ VTTqa0EgRvy7mCIEK0yUg2nJeoqTrZtdMETqM=*Domainkey-signature*: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=B5m9IovontbADKvk9m1BDLMWvpJgrcCVoUQjHEY2OkO6punomeWRkFDRoEPUQqBcOH d2HHPh4VWaIFlIL67ZBhBJF4mT48C4/tFTtxfMBa4vaRKTwMJRYoNvlJr6hNTmQC3Mvy SEvMJ1cyOpTbzqz5/QGSu+88/yLQvi6QosyGw=

Some news from this fork. Actually, achieving reasonably good performance required much more rework of the product kernel that I thought. The result is that we now have reasonably good performance for complex-complex matrix products, but also high performance for mixed real-complex and complex-real products. Since I doubt MKL supports mixed real-complex product routines, I believe we are by far the fastest library for these tasks :) Explanation: Let me first recall that at the lowest blocking level, our product kernel picks one small vertical panel of the RHS (e.g., 128xNR) and one small horizontal panel of the LHS (e.g., MRx128). Then it computes these MRxNR "dot products" at once while accumulating into registers (the memory writes occur only at the end of this accumulation). The values of MR and NR depends on the number of available registers (e.g., for floats with 16 SSE registers we take MR=8 and NR=4). The current small vertical panel of the RHS is reused many time for several horizontal panels of the LHS. These dot products are vectorized by computing packet * scalar products (e.g., [a_i+0 a_i+1 a_i+2 a_i+3] * b_j) that requires to unpack each rhs scalar b_j, e.g.: [a_i+0 a_i+1 a_i+2 a_i+3] * [b_j b_j b_j b_j]. This unpacking is costly and therefore it is amortized by pre-unpacking the small vertical panel of the RHS into a temporary buffer. Let's recall the complex-complex product formula: (ar + i ai) * (br + i bi) = (ar*br - ai*bi) + i(ar*bi + ai*br) For complex<double>, we store one scalar per SSE register ([ar,ai], [br,bi]), and the full product can be vectorized by doing only 2 MADD (mul+add), but 3 swizzling and one "bitwise or" operation to negate some coefficients. This make it far to be optimal. In particular swizzling is extremely costly. Moreover all these "complex" operations requires many additional registers to store the temporaries. So, the trick is to observe that a sum of N complex-complex products can be computed with only 2*N MADD and a single one swizzling operation by computing the sum of each of the four real-real products separately. This requires 2 registers for the accumulation instead of one: [sum(ar_k*br_k), sum(ai_k*br_k)] [sum(ar_k*bi_k), sum(ai_k*bi_k)] (As I do for real, I also pre-unpack [br_k,bi_k] as [br_k,br_k],[bi_k,bi_k] into a small temporary buffer to amortize the cost of this unpacking). At the end of the register level accumulation, i.e., before accumulating this sum into the destination matrix, we can easily re-assemble these two registers to get the final result. Another advantage is that we don't have to distinguish between the optional conjugation states of the lhs and rhs during the accumulation, but only at the final re-assembly stage. This required a generalization of the product kernel that also allowed me to vectorize real * complex products. This later is still not as optimal as the complex*real case because it still requires N swizzling on the LHS (recall the vectorization is not symmetric wrt the left and right hand sides) Nevertheless, it is still faster than complex-complex products, and faster than transposing the product to be in the complex*real case. Note that I still need more testing and cleaning before merging.... cheers, gael

**Follow-Ups**:**Re: [eigen] vectorization of complex***From:*FMDSPAM

**Re: [eigen] vectorization of complex***From:*Benoit Jacob

**References**:**[eigen] vectorization of complex***From:*Gael Guennebaud

**Re: [eigen] vectorization of complex***From:*Benoit Jacob

**Re: [eigen] vectorization of complex***From:*Gael Guennebaud

**Re: [eigen] vectorization of complex***From:*Benoit Jacob

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] Permuted Matrix** - Next by Date:
**[eigen] unsubscribe** - Previous by thread:
**Re: [eigen] vectorization of complex** - Next by thread:
**Re: [eigen] vectorization of complex**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |