Re: [eigen] NEON PacketMath.h questions |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: Konstantinos Margaritis <konstantinos.margaritis@xxxxxxxxxxx>
- Subject: Re: [eigen] NEON PacketMath.h questions
- From: Josh Bleecher Snyder <josharian@xxxxxxxxx>
- Date: Wed, 4 Jul 2012 06:50:46 -0700
- Cc: eigen@xxxxxxxxxxxxxxxxxxx
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=bimgnkwLmvm33G2IzF1IRA/casAEglSxJXhUEhx/I3c=; b=eoifyTXO/9dpGzpZ9rLw8U9LUunJidGPOyneDoPlENRSkxlpjfVuwvZfP8b3GTayBi BUslleHMeN1fmn+pQWb4tAmTsvv40F0qdEYptahpCIZApasA2zpdgo9VRrEVpqAr9ted uBBsJseeYqMxPCD0V3Fft3RcwEMtNe72FxADH/WbL76lSZz6V4DqmBkAGokPhib5S4Uf /jxcj+ONGxWKYQZiWjDffLoalokT60biK6LLSzrYbuRl3JdDGgmkCiNOWmAkYiCbHlIB v0+96BtbuR0YtubEQn4W2O7riYkw02Y4DACn6olXuK07sE/TgaVppbXW+4/GG2J9FIY9 CkgA==
>> (1)
>>
>> pmadd is currently implemented as:
>>
>> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
>> Packet4i& b, const Packet4i& c) { return padd(pmul(a,b), c); }
>>
>> However, NEON has a fused multiply accumulate instruction, which is
>> both faster and more accurate. This replacement uses the corresponding
>> intrinsic for it:
>>
>> template<> EIGEN_STRONG_INLINE Packet4f pmadd(const Packet4f& a, const
>> Packet4f& b, const Packet4f& c) { return vmlaq_f32(c, a, b); }
>> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
>> Packet4i& b, const Packet4i& c) { return vmlaq_s32(c, a, b); }
>>
>> Unfortunately, both gcc and clang generate vmul + vadd instructions
>> out of this intrinsic (even when real accumulation is possible),
>> rather than a single vmla instruction as requested. Still, it's the
>> Right Thing To Do, I think, and compiler support will hopefully
>> improve.
>>
>> One could also force the use of the vmla instruction via inline
>> assembly. This would require distinguishing multiply+add (return a
>> value) from multiply+accumulate (modify one of the inputs).
>> Multiply+accumulate is extremely common in gemm/gemv and probably
>> worth optimizing for. I'm willing to take a crack at adding such a
>> method (pmacc? pmadd_inplace? pfma?), although I might need a few
>> pointers -- I'm new to both C++ templating and Eigen -- and would
>> probably need help with SSE/AltiVec. So before I start on it: Is there
>> interest in such an addition?
>
> I actually fixed that in a commit I forgot to push, thanks for the
> reminder!
Sweet. Got any other good unpushed commits? :)
>> (2)
>>
>> The various NEON predux implementations go by way of a local variable.
>> I think that's unnecessary and over-complicated, and it doesn't get
>> compiled away (at least by gcc at -O3). For example, it appears that:
>>
>> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f&
>> a) {
>> float32x2_t a_lo, a_hi, sum;
>> float s[2];
>>
>> a_lo = vget_low_f32(a);
>> a_hi = vget_high_f32(a);
>> sum = vpadd_f32(a_lo, a_hi);
>> sum = vpadd_f32(sum, sum);
>> vst1_f32(s, sum);
>>
>> return s[0];
>> }
>>
>> could just be:
>>
>> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f&
>> a) {
>> float32x2_t a_lo, a_hi, sum;
>> a_lo = vget_low_f32(a);
>> a_hi = vget_high_f32(a);
>> sum = vpadd_f32(a_lo, a_hi);
>> sum = vpadd_f32(sum, sum);
>> return vget_lane_f32(sum, 0);
>> }
>>
>>
>> Assembly excerpt for the former:
>>
>> add r1, sp, #216
>> vpadd.f32 d20, d20, d21
>> vpadd.f32 d18, d18, d19
>> vpadd.f32 d16, d16, d17
>> vpadd.f32 d21, d22, d22
>> vpadd.f32 d19, d20, d20
>> vpadd.f32 d17, d18, d18
>> vpadd.f32 d16, d16, d16
>> vst1.32 {d21}, [r1]
>> vldr.32 s8, [sp, #216]
>> vst1.32 {d19}, [r1]
>> vadd.f32 d0, d4, d0
>> vldr.32 s10, [sp, #216]
>> vst1.32 {d17}, [r1]
>> vadd.f32 d3, d5, d3
>> vldr.32 s8, [sp, #216]
>> vst1.32 {d16}, [r1]
>> vadd.f32 d2, d4, d2
>> vldr.32 s10, [sp, #216]
>> vadd.f32 d1, d5, d1
>>
>> And for the latter -- contains no gratuitous loads and stores:
>>
>> vpadd.f32 d17, d18, d19
>> vpadd.f32 d18, d20, d21
>> vpadd.f32 d20, d22, d23
>> vpadd.f32 d4, d16, d16
>> vpadd.f32 d5, d17, d17
>> vpadd.f32 d6, d18, d18
>> vpadd.f32 d7, d20, d20
>> vadd.f32 d1, d4, d1
>> vadd.f32 d2, d5, d2
>> vadd.f32 d3, d6, d3
>> vadd.f32 d0, d7, d0
>>
>> My tests using those two implementations show them yielding identical
>> results.
>>
>> Questions: Is there something I don't know about that makes the
>> current way safer or more correct? And if not, is there interest in a
>> patch fixing all of these?
>
> Thanks for the tips, I went and fixed all of NEON PackatMath.h methods
> with your suggestions! Will commit in a moment.
That was fast! Thanks; glad it was of use.
>> (3)
>>
>> Is there a good way to run the existing Eigen unit tests on an iOS or
>> Android device? (E.g. is there an Xcode-friendly test wrapper? Is
>> there interest in one?) I don't have -- or intend to acquire -- a
>> beagle board, so testing NEON changes is currently a bit manual and ad
>> hoc.
>
> Unfortunately I do not own a Mac or an iPhone to test those, my work is
> strictly on ARM devices with Linux or Android. But I don't think anyone
> would mind iOS specific patches.
Cool. I'll see whether I can cook up something simple and unobtrusive...
-josh