Re: [eigen] NEON PacketMath.h questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


>> (1)
>>
>> pmadd is currently implemented as:
>>
>> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
>> Packet4i& b, const Packet4i& c) { return padd(pmul(a,b), c); }
>>
>> However, NEON has a fused multiply accumulate instruction, which is
>> both faster and more accurate. This replacement uses the corresponding
>> intrinsic for it:
>>
>> template<> EIGEN_STRONG_INLINE Packet4f pmadd(const Packet4f& a, const
>> Packet4f& b, const Packet4f& c) { return vmlaq_f32(c, a, b); }
>> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
>> Packet4i& b, const Packet4i& c) { return vmlaq_s32(c, a, b); }
>>
>> Unfortunately, both gcc and clang generate vmul + vadd instructions
>> out of this intrinsic (even when real accumulation is possible),
>> rather than a single vmla instruction as requested. Still, it's the
>> Right Thing To Do, I think, and compiler support will hopefully
>> improve.
>>
>> One could also force the use of the vmla instruction via inline
>> assembly. This would require distinguishing multiply+add (return a
>> value) from multiply+accumulate (modify one of the inputs).
>> Multiply+accumulate is extremely common in gemm/gemv and probably
>> worth optimizing for. I'm willing to take a crack at adding such a
>> method (pmacc? pmadd_inplace? pfma?), although I might need a few
>> pointers -- I'm new to both C++ templating and Eigen -- and would
>> probably need help with SSE/AltiVec. So before I start on it: Is there
>> interest in such an addition?
>
> I actually fixed that in a commit I forgot to push, thanks for the
> reminder!

Sweet. Got any other good unpushed commits? :)


>> (2)
>>
>> The various NEON predux implementations go by way of a local variable.
>> I think that's unnecessary and over-complicated, and it doesn't get
>> compiled away (at least by gcc at -O3). For example, it appears that:
>>
>> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f&
>> a) {
>>   float32x2_t a_lo, a_hi, sum;
>>   float s[2];
>>
>>   a_lo = vget_low_f32(a);
>>   a_hi = vget_high_f32(a);
>>   sum = vpadd_f32(a_lo, a_hi);
>>   sum = vpadd_f32(sum, sum);
>>   vst1_f32(s, sum);
>>
>>   return s[0];
>> }
>>
>> could just be:
>>
>> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f&
>> a) {
>>   float32x2_t a_lo, a_hi, sum;
>>   a_lo = vget_low_f32(a);
>>   a_hi = vget_high_f32(a);
>>   sum = vpadd_f32(a_lo, a_hi);
>>   sum = vpadd_f32(sum, sum);
>>   return vget_lane_f32(sum, 0);
>> }
>>
>>
>> Assembly excerpt for the former:
>>
>>       add     r1, sp, #216
>>       vpadd.f32       d20, d20, d21
>>       vpadd.f32       d18, d18, d19
>>       vpadd.f32       d16, d16, d17
>>       vpadd.f32       d21, d22, d22
>>       vpadd.f32       d19, d20, d20
>>       vpadd.f32       d17, d18, d18
>>       vpadd.f32       d16, d16, d16
>>       vst1.32 {d21}, [r1]
>>       vldr.32 s8, [sp, #216]
>>       vst1.32 {d19}, [r1]
>>       vadd.f32        d0, d4, d0
>>       vldr.32 s10, [sp, #216]
>>       vst1.32 {d17}, [r1]
>>       vadd.f32        d3, d5, d3
>>       vldr.32 s8, [sp, #216]
>>       vst1.32 {d16}, [r1]
>>       vadd.f32        d2, d4, d2
>>       vldr.32 s10, [sp, #216]
>>       vadd.f32        d1, d5, d1
>>
>> And for the latter -- contains no gratuitous loads and stores:
>>
>>       vpadd.f32       d17, d18, d19
>>       vpadd.f32       d18, d20, d21
>>       vpadd.f32       d20, d22, d23
>>       vpadd.f32       d4, d16, d16
>>       vpadd.f32       d5, d17, d17
>>       vpadd.f32       d6, d18, d18
>>       vpadd.f32       d7, d20, d20
>>       vadd.f32        d1, d4, d1
>>       vadd.f32        d2, d5, d2
>>       vadd.f32        d3, d6, d3
>>       vadd.f32        d0, d7, d0
>>
>> My tests using those two implementations show them yielding identical
>> results.
>>
>> Questions: Is there something I don't know about that makes the
>> current way safer or more correct? And if not, is there interest in a
>> patch fixing all of these?
>
> Thanks for the tips, I went and fixed all of NEON PackatMath.h methods
> with your suggestions! Will commit in a moment.

That was fast! Thanks; glad it was of use.



>> (3)
>>
>> Is there a good way to run the existing Eigen unit tests on an iOS or
>> Android device? (E.g. is there an Xcode-friendly test wrapper? Is
>> there interest in one?) I don't have -- or intend to acquire -- a
>> beagle board, so testing NEON changes is currently a bit manual and ad
>> hoc.
>
> Unfortunately I do not own a Mac or an iPhone to test those, my work is
> strictly on ARM devices with Linux or Android. But I don't think anyone
> would mind iOS specific patches.

Cool. I'll see whether I can cook up something simple and unobtrusive...


-josh



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/