[eigen] NEON PacketMath.h questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: [eigen] NEON PacketMath.h questions
From: Josh Bleecher Snyder <josharian@xxxxxxxxx>
Date: Tue, 3 Jul 2012 15:05:38 -0700
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=t3hSz/XaR6nAsrSfWXqaMXrvX6fyGKOIJEhWHpR32bs=; b=XHFiDBnoTkO83t5TKg6QfHI8GDxprZl3Clq7A3bJmxk8byZOvcLdO0mHmkfc+y1eoI Stpnem0F24CrA+dWfHwxcPT18XM6zYH5IhBHCcSilu2LBtOcL2X0BtKSFy1MR8sqmCI8 5RHCZEPZsAq/IuY3+ZHHCnH5aWWji3A5ZnwontEcyqqqdUBl/xfOt63YAVqaJ8rAgg5I qPGWj2sG68IHifURTmbUeBxzwEGZ4EVhyamdOMK+qO62NWmFhYS49sjmmzFrwRyzHCqk udhM6fOvA0oHN2znZAU5DHD+Icj52EBeNUbbOlJTi9UtF2fo4dqI1uhE2siaH7TOASwv swyg==

Hi,

I've been poking through the NEON PacketMath.h. I have three questions
/ suggestions.


(1)

pmadd is currently implemented as:

template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
Packet4i& b, const Packet4i& c) { return padd(pmul(a,b), c); }

However, NEON has a fused multiply accumulate instruction, which is
both faster and more accurate. This replacement uses the corresponding
intrinsic for it:

template<> EIGEN_STRONG_INLINE Packet4f pmadd(const Packet4f& a, const
Packet4f& b, const Packet4f& c) { return vmlaq_f32(c, a, b); }
template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
Packet4i& b, const Packet4i& c) { return vmlaq_s32(c, a, b); }

Unfortunately, both gcc and clang generate vmul + vadd instructions
out of this intrinsic (even when real accumulation is possible),
rather than a single vmla instruction as requested. Still, it's the
Right Thing To Do, I think, and compiler support will hopefully
improve.

One could also force the use of the vmla instruction via inline
assembly. This would require distinguishing multiply+add (return a
value) from multiply+accumulate (modify one of the inputs).
Multiply+accumulate is extremely common in gemm/gemv and probably
worth optimizing for. I'm willing to take a crack at adding such a
method (pmacc? pmadd_inplace? pfma?), although I might need a few
pointers -- I'm new to both C++ templating and Eigen -- and would
probably need help with SSE/AltiVec. So before I start on it: Is there
interest in such an addition?


(2)

The various NEON predux implementations go by way of a local variable.
I think that's unnecessary and over-complicated, and it doesn't get
compiled away (at least by gcc at -O3). For example, it appears that:

template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
{
  float32x2_t a_lo, a_hi, sum;
  float s[2];

  a_lo = vget_low_f32(a);
  a_hi = vget_high_f32(a);
  sum = vpadd_f32(a_lo, a_hi);
  sum = vpadd_f32(sum, sum);
  vst1_f32(s, sum);

  return s[0];
}

could just be:

template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
{
  float32x2_t a_lo, a_hi, sum;
  a_lo = vget_low_f32(a);
  a_hi = vget_high_f32(a);
  sum = vpadd_f32(a_lo, a_hi);
  sum = vpadd_f32(sum, sum);
  return vget_lane_f32(sum, 0);
}


Assembly excerpt for the former:

	add	r1, sp, #216
	vpadd.f32	d20, d20, d21
	vpadd.f32	d18, d18, d19
	vpadd.f32	d16, d16, d17
	vpadd.f32	d21, d22, d22
	vpadd.f32	d19, d20, d20
	vpadd.f32	d17, d18, d18
	vpadd.f32	d16, d16, d16
	vst1.32	{d21}, [r1]
	vldr.32	s8, [sp, #216]
	vst1.32	{d19}, [r1]
	vadd.f32	d0, d4, d0
	vldr.32	s10, [sp, #216]
	vst1.32	{d17}, [r1]
	vadd.f32	d3, d5, d3
	vldr.32	s8, [sp, #216]
	vst1.32	{d16}, [r1]
	vadd.f32	d2, d4, d2
	vldr.32	s10, [sp, #216]
	vadd.f32	d1, d5, d1

And for the latter -- contains no gratuitous loads and stores:

	vpadd.f32	d17, d18, d19
	vpadd.f32	d18, d20, d21
	vpadd.f32	d20, d22, d23
	vpadd.f32	d4, d16, d16
	vpadd.f32	d5, d17, d17
	vpadd.f32	d6, d18, d18
	vpadd.f32	d7, d20, d20
	vadd.f32	d1, d4, d1
	vadd.f32	d2, d5, d2
	vadd.f32	d3, d6, d3
	vadd.f32	d0, d7, d0

My tests using those two implementations show them yielding identical results.

Questions: Is there something I don't know about that makes the
current way safer or more correct? And if not, is there interest in a
patch fixing all of these?


(3)

Is there a good way to run the existing Eigen unit tests on an iOS or
Android device? (E.g. is there an Xcode-friendly test wrapper? Is
there interest in one?) I don't have -- or intend to acquire -- a
beagle board, so testing NEON changes is currently a bit manual and ad
hoc.


Sorry for the long email.

-josh

Follow-Ups:
- Re: [eigen] NEON PacketMath.h questions
  - From: Konstantinos Margaritis
- Re: [eigen] NEON PacketMath.h questions
  - From: Gael Guennebaud

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Evaluator design
Next by Date: [eigen] nvorp安图吉盛建筑公司任辰东
Previous by thread: Re: [eigen] Evaluator design
Next by thread: Re: [eigen] NEON PacketMath.h questions

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/