Re: [eigen] NEON PacketMath.h questions |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: Konstantinos Margaritis <konstantinos.margaritis@xxxxxxxxxxx>*Subject*: Re: [eigen] NEON PacketMath.h questions*From*: Josh Bleecher Snyder <josharian@xxxxxxxxx>*Date*: Wed, 4 Jul 2012 06:50:46 -0700*Cc*: eigen@xxxxxxxxxxxxxxxxxxx*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=bimgnkwLmvm33G2IzF1IRA/casAEglSxJXhUEhx/I3c=; b=eoifyTXO/9dpGzpZ9rLw8U9LUunJidGPOyneDoPlENRSkxlpjfVuwvZfP8b3GTayBi BUslleHMeN1fmn+pQWb4tAmTsvv40F0qdEYptahpCIZApasA2zpdgo9VRrEVpqAr9ted uBBsJseeYqMxPCD0V3Fft3RcwEMtNe72FxADH/WbL76lSZz6V4DqmBkAGokPhib5S4Uf /jxcj+ONGxWKYQZiWjDffLoalokT60biK6LLSzrYbuRl3JdDGgmkCiNOWmAkYiCbHlIB v0+96BtbuR0YtubEQn4W2O7riYkw02Y4DACn6olXuK07sE/TgaVppbXW+4/GG2J9FIY9 CkgA==

>> (1) >> >> pmadd is currently implemented as: >> >> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const >> Packet4i& b, const Packet4i& c) { return padd(pmul(a,b), c); } >> >> However, NEON has a fused multiply accumulate instruction, which is >> both faster and more accurate. This replacement uses the corresponding >> intrinsic for it: >> >> template<> EIGEN_STRONG_INLINE Packet4f pmadd(const Packet4f& a, const >> Packet4f& b, const Packet4f& c) { return vmlaq_f32(c, a, b); } >> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const >> Packet4i& b, const Packet4i& c) { return vmlaq_s32(c, a, b); } >> >> Unfortunately, both gcc and clang generate vmul + vadd instructions >> out of this intrinsic (even when real accumulation is possible), >> rather than a single vmla instruction as requested. Still, it's the >> Right Thing To Do, I think, and compiler support will hopefully >> improve. >> >> One could also force the use of the vmla instruction via inline >> assembly. This would require distinguishing multiply+add (return a >> value) from multiply+accumulate (modify one of the inputs). >> Multiply+accumulate is extremely common in gemm/gemv and probably >> worth optimizing for. I'm willing to take a crack at adding such a >> method (pmacc? pmadd_inplace? pfma?), although I might need a few >> pointers -- I'm new to both C++ templating and Eigen -- and would >> probably need help with SSE/AltiVec. So before I start on it: Is there >> interest in such an addition? > > I actually fixed that in a commit I forgot to push, thanks for the > reminder! Sweet. Got any other good unpushed commits? :) >> (2) >> >> The various NEON predux implementations go by way of a local variable. >> I think that's unnecessary and over-complicated, and it doesn't get >> compiled away (at least by gcc at -O3). For example, it appears that: >> >> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& >> a) { >> float32x2_t a_lo, a_hi, sum; >> float s[2]; >> >> a_lo = vget_low_f32(a); >> a_hi = vget_high_f32(a); >> sum = vpadd_f32(a_lo, a_hi); >> sum = vpadd_f32(sum, sum); >> vst1_f32(s, sum); >> >> return s[0]; >> } >> >> could just be: >> >> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& >> a) { >> float32x2_t a_lo, a_hi, sum; >> a_lo = vget_low_f32(a); >> a_hi = vget_high_f32(a); >> sum = vpadd_f32(a_lo, a_hi); >> sum = vpadd_f32(sum, sum); >> return vget_lane_f32(sum, 0); >> } >> >> >> Assembly excerpt for the former: >> >> add r1, sp, #216 >> vpadd.f32 d20, d20, d21 >> vpadd.f32 d18, d18, d19 >> vpadd.f32 d16, d16, d17 >> vpadd.f32 d21, d22, d22 >> vpadd.f32 d19, d20, d20 >> vpadd.f32 d17, d18, d18 >> vpadd.f32 d16, d16, d16 >> vst1.32 {d21}, [r1] >> vldr.32 s8, [sp, #216] >> vst1.32 {d19}, [r1] >> vadd.f32 d0, d4, d0 >> vldr.32 s10, [sp, #216] >> vst1.32 {d17}, [r1] >> vadd.f32 d3, d5, d3 >> vldr.32 s8, [sp, #216] >> vst1.32 {d16}, [r1] >> vadd.f32 d2, d4, d2 >> vldr.32 s10, [sp, #216] >> vadd.f32 d1, d5, d1 >> >> And for the latter -- contains no gratuitous loads and stores: >> >> vpadd.f32 d17, d18, d19 >> vpadd.f32 d18, d20, d21 >> vpadd.f32 d20, d22, d23 >> vpadd.f32 d4, d16, d16 >> vpadd.f32 d5, d17, d17 >> vpadd.f32 d6, d18, d18 >> vpadd.f32 d7, d20, d20 >> vadd.f32 d1, d4, d1 >> vadd.f32 d2, d5, d2 >> vadd.f32 d3, d6, d3 >> vadd.f32 d0, d7, d0 >> >> My tests using those two implementations show them yielding identical >> results. >> >> Questions: Is there something I don't know about that makes the >> current way safer or more correct? And if not, is there interest in a >> patch fixing all of these? > > Thanks for the tips, I went and fixed all of NEON PackatMath.h methods > with your suggestions! Will commit in a moment. That was fast! Thanks; glad it was of use. >> (3) >> >> Is there a good way to run the existing Eigen unit tests on an iOS or >> Android device? (E.g. is there an Xcode-friendly test wrapper? Is >> there interest in one?) I don't have -- or intend to acquire -- a >> beagle board, so testing NEON changes is currently a bit manual and ad >> hoc. > > Unfortunately I do not own a Mac or an iPhone to test those, my work is > strictly on ARM devices with Linux or Android. But I don't think anyone > would mind iOS specific patches. Cool. I'll see whether I can cook up something simple and unobtrusive... -josh

**References**:**[eigen] NEON PacketMath.h questions***From:*Josh Bleecher Snyder

**Re: [eigen] NEON PacketMath.h questions***From:*Konstantinos Margaritis

**Messages sorted by:**[ date | thread ]- Prev by Date:
**[eigen] Buy Ciails and Viarga online!** - Next by Date:
**Re: [eigen] NEON PacketMath.h questions** - Previous by thread:
**Re: [eigen] NEON PacketMath.h questions** - Next by thread:
**Re: [eigen] NEON PacketMath.h questions**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |