Re: [eigen] Array of complex numbers |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen <eigen@xxxxxxxxxxxxxxxxxxx>*Subject*: Re: [eigen] Array of complex numbers*From*: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>*Date*: Fri, 11 Jan 2019 11:13:00 +0100*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=Ecn0/YiOOqYLFFrZF7+pI4ofc6mnc12QavyyCGuiLiM=; b=GoK8TQ82vQ0GQLQD8f5a1l8uLL0gGB4KKzl0SYxot27BRjyAOHIEZxNGMxaH+Iz7v8 xYi8Bt3LBGUpZmSCM0HIswbl4nIqeb7CRSXs9tAI+AsAFgQpUhnElM7gt7kcqsblnlrj hsdE2Du37ufENGWzuKeva+TbnCLFAWpdREXbvzI1zHCVdHJ4MsNaAglMcc4NKVhY1JHp OrhtXoLSYByyQX7lmppGVMMF+Mp7iY8/wUwhL2aBWCGXhn/xTF3ccVJY7M4RVd2HkOYP gh/zl5NPxsv3jeUNwG9S8cwmqglT3Imo/k8bOe71lMWkpaULICSY6GKNHY5Srf6NnlTv H38A==

On Fri, Jan 11, 2019 at 12:47 AM Francois Fayard <fayard@xxxxxxxxxxxxx> wrote:

Thanks,I’ve managed to make it work and, for the scalar product, your code is as fast as my code with a structure of arrays. That’s well done.That’s too bad that we cannot get the full speed for linear combination with complex numbers. As you have said, these optimizations are only useful when we are not limited by the bandwidth, which is when the vectors are small.

With clang I get same speed because it manage to move constant code out of the loop. Sadly both GCC and ICC keeps them in the loop. For instance, here is the asm generated by ICC for:

..B1.21: # Preds ..B1.19 ..B1.21

vmovups ymm2, YMMWORD PTR [r14+rdx*8] #317.127

vbroadcastsd ymm1, QWORD PTR [8+rsp] #83.37

vmovsldup ymm0, ymm1 #66.31

vmovshdup ymm3, ymm1 #67.31

vpermilps ymm4, ymm2, 177 #67.56

vmulps ymm5, ymm0, ymm2 #66.17

vmulps ymm6, ymm3, ymm4 #67.17

vaddsubps ymm7, ymm5, ymm6 #68.19

vmovups YMMWORD PTR [rdi+rdx*8], ymm7 #354.130

add rdx, 4 #419.57

cmp rdx, rsi #419.45

jl ..B1.21 # Prob 82% #419.45

Here ymm1, ymm0, and ymm3 could be moved out of the loop. There is no aliasing restriction because ymm0 comes from 'c' which is a local variable. If I use float instead of complex<float>, the "broadcast" is indeed moved out of the loop by any compiler, e.g. with ICC:

vshufps xmm1, xmm0, xmm0, 0 #129.86

vinsertf128 ymm1, ymm1, xmm1, 1 #129.86

..B1.21: # Preds ..B1.21 ..B1.20

vmulps ymm2, ymm1, YMMWORD PTR [r15+rax*4] #173.103

vmovups YMMWORD PTR [rdi+rax*4], ymm2 #354.130

add rax, 8 #419.57

cmp rax, rsi #419.45

jl ...B1.21 # Prob 82% #419.45

Doing this explicitly in Eigen's code in a generic manner is not trivial at all. If someone has an idea on how to help the compilers doing a proper job, please share it!

Gaël.

Thanks,François FayardOn 10 Jan 2019, at 23:09, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:for pzero, you need the devel branch of Eigen, or just replace it by pset1<Packet>(0).For linear combination with complex coeffs I don't think it's possible to get close SoA for such small (i.e. in-cache) arrays.gaelOn Thu, Jan 10, 2019 at 11:00 PM Francois Fayard <fayard@xxxxxxxxxxxxx> wrote:Thanks,

You are right about the operations. Obviously, for the “linear combination”, I am interested in the case where rk are complex numbers.

I have tried to compile your code that implements the scalar product, but I don’t have any definition for pzero? Where is it?

François

> On 10 Jan 2019, at 22:25, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

>

> Hi,

>

> just to be sure I fully understood the operations you're considering, let's assume you have:

>

> VectorXcf A1(300), A2(300), ...;

> complex<float> c1, c2, ...;

> float r1, r2, ...;

>

> then "Scalar product" means:

>

> A1.dot(A2)

>

> and "Linear combination" means:

>

> r1*A1 + r2*A2 + ...

>

> If this is right, then linear combination with real coefficients should already be 100% optimal and the only problematic operation could be dot products because of the complex*complex product overhead. I think those can be optimized while keeping an AoS layout by accumulating within multiple packets that we combine only at the end of the reduction instead of doing and expensive combination for every product. This is similar to what we do in GEMM.

>

> I gave it a shot, and I get similar speed than with a SoA layout. See attached file. This is a proof of concept : AVX only, no conjugation, n%16==0, etc.

>

> I think this can be integrated within Eigen by generalizing a bit the current reduction code to allow for custom accumulation packets.

>

> cheers,

> gael

> <bench_cplx_dotprod.cpp>

**References**:**[eigen] Array of complex numbers***From:*Francois Fayard

**Re: [eigen] Array of complex numbers***From:*Gael Guennebaud

**Re: [eigen] Array of complex numbers***From:*Francois Fayard

**Re: [eigen] Array of complex numbers***From:*Gael Guennebaud

**Re: [eigen] Array of complex numbers***From:*Francois Fayard

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] Array of complex numbers** - Next by Date:
**Re: [eigen] Array of complex numbers** - Previous by thread:
**Re: [eigen] Array of complex numbers** - Next by thread:
**Re: [eigen] Array of complex numbers**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |