Re: [eigen] NEON PacketMath.h questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] NEON PacketMath.h questions
From: Konstantinos Margaritis <konstantinos.margaritis@xxxxxxxxxxx>
Date: Wed, 4 Jul 2012 11:10:06 +0300
Cc: Josh Bleecher Snyder <josharian@xxxxxxxxx>

On Tue, 3 Jul 2012 15:05:38 -0700
Josh Bleecher Snyder <josharian@xxxxxxxxx> wrote:

> Hi,
> 
> I've been poking through the NEON PacketMath.h. I have three questions
> / suggestions.
> 
> 
> (1)
> 
> pmadd is currently implemented as:
> 
> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
> Packet4i& b, const Packet4i& c) { return padd(pmul(a,b), c); }
> 
> However, NEON has a fused multiply accumulate instruction, which is
> both faster and more accurate. This replacement uses the corresponding
> intrinsic for it:
> 
> template<> EIGEN_STRONG_INLINE Packet4f pmadd(const Packet4f& a, const
> Packet4f& b, const Packet4f& c) { return vmlaq_f32(c, a, b); }
> template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
> Packet4i& b, const Packet4i& c) { return vmlaq_s32(c, a, b); }
> 
> Unfortunately, both gcc and clang generate vmul + vadd instructions
> out of this intrinsic (even when real accumulation is possible),
> rather than a single vmla instruction as requested. Still, it's the
> Right Thing To Do, I think, and compiler support will hopefully
> improve.
> 
> One could also force the use of the vmla instruction via inline
> assembly. This would require distinguishing multiply+add (return a
> value) from multiply+accumulate (modify one of the inputs).
> Multiply+accumulate is extremely common in gemm/gemv and probably
> worth optimizing for. I'm willing to take a crack at adding such a
> method (pmacc? pmadd_inplace? pfma?), although I might need a few
> pointers -- I'm new to both C++ templating and Eigen -- and would
> probably need help with SSE/AltiVec. So before I start on it: Is there
> interest in such an addition?

I actually fixed that in a commit I forgot to push, thanks for the
reminder!
> 
> (2)
> 
> The various NEON predux implementations go by way of a local variable.
> I think that's unnecessary and over-complicated, and it doesn't get
> compiled away (at least by gcc at -O3). For example, it appears that:
> 
> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f&
> a) {
>   float32x2_t a_lo, a_hi, sum;
>   float s[2];
> 
>   a_lo = vget_low_f32(a);
>   a_hi = vget_high_f32(a);
>   sum = vpadd_f32(a_lo, a_hi);
>   sum = vpadd_f32(sum, sum);
>   vst1_f32(s, sum);
> 
>   return s[0];
> }
> 
> could just be:
> 
> template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f&
> a) {
>   float32x2_t a_lo, a_hi, sum;
>   a_lo = vget_low_f32(a);
>   a_hi = vget_high_f32(a);
>   sum = vpadd_f32(a_lo, a_hi);
>   sum = vpadd_f32(sum, sum);
>   return vget_lane_f32(sum, 0);
> }
> 
> 
> Assembly excerpt for the former:
> 
> 	add	r1, sp, #216
> 	vpadd.f32	d20, d20, d21
> 	vpadd.f32	d18, d18, d19
> 	vpadd.f32	d16, d16, d17
> 	vpadd.f32	d21, d22, d22
> 	vpadd.f32	d19, d20, d20
> 	vpadd.f32	d17, d18, d18
> 	vpadd.f32	d16, d16, d16
> 	vst1.32	{d21}, [r1]
> 	vldr.32	s8, [sp, #216]
> 	vst1.32	{d19}, [r1]
> 	vadd.f32	d0, d4, d0
> 	vldr.32	s10, [sp, #216]
> 	vst1.32	{d17}, [r1]
> 	vadd.f32	d3, d5, d3
> 	vldr.32	s8, [sp, #216]
> 	vst1.32	{d16}, [r1]
> 	vadd.f32	d2, d4, d2
> 	vldr.32	s10, [sp, #216]
> 	vadd.f32	d1, d5, d1
> 
> And for the latter -- contains no gratuitous loads and stores:
> 
> 	vpadd.f32	d17, d18, d19
> 	vpadd.f32	d18, d20, d21
> 	vpadd.f32	d20, d22, d23
> 	vpadd.f32	d4, d16, d16
> 	vpadd.f32	d5, d17, d17
> 	vpadd.f32	d6, d18, d18
> 	vpadd.f32	d7, d20, d20
> 	vadd.f32	d1, d4, d1
> 	vadd.f32	d2, d5, d2
> 	vadd.f32	d3, d6, d3
> 	vadd.f32	d0, d7, d0
> 
> My tests using those two implementations show them yielding identical
> results.
> 
> Questions: Is there something I don't know about that makes the
> current way safer or more correct? And if not, is there interest in a
> patch fixing all of these?

Thanks for the tips, I went and fixed all of NEON PackatMath.h methods
with your suggestions! Will commit in a moment.

> 
> 
> (3)
> 
> Is there a good way to run the existing Eigen unit tests on an iOS or
> Android device? (E.g. is there an Xcode-friendly test wrapper? Is
> there interest in one?) I don't have -- or intend to acquire -- a
> beagle board, so testing NEON changes is currently a bit manual and ad
> hoc.

Unfortunately I do not own a Mac or an iPhone to test those, my work is
strictly on ARM devices with Linux or Android. But I don't think anyone
would mind iOS specific patches.

Regards

Konstantinos

Follow-Ups:
- Re: [eigen] NEON PacketMath.h questions
  - From: Josh Bleecher Snyder

References:
- [eigen] NEON PacketMath.h questions
  - From: Josh Bleecher Snyder

Messages sorted by: [ date | thread ]
Prev by Date: [eigen] nvorp安图吉盛建筑公司任辰东
Next by Date: [eigen] aqjmy崇明县化轻公司刘岚雨
Previous by thread: [eigen] NEON PacketMath.h questions
Next by thread: Re: [eigen] NEON PacketMath.h questions

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/