[eigen] NEON PacketMath.h questions

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi,

I've been poking through the NEON PacketMath.h. I have three questions
/ suggestions.


(1)

pmadd is currently implemented as:

template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
Packet4i& b, const Packet4i& c) { return padd(pmul(a,b), c); }

However, NEON has a fused multiply accumulate instruction, which is
both faster and more accurate. This replacement uses the corresponding
intrinsic for it:

template<> EIGEN_STRONG_INLINE Packet4f pmadd(const Packet4f& a, const
Packet4f& b, const Packet4f& c) { return vmlaq_f32(c, a, b); }
template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
Packet4i& b, const Packet4i& c) { return vmlaq_s32(c, a, b); }

Unfortunately, both gcc and clang generate vmul + vadd instructions
out of this intrinsic (even when real accumulation is possible),
rather than a single vmla instruction as requested. Still, it's the
Right Thing To Do, I think, and compiler support will hopefully
improve.

One could also force the use of the vmla instruction via inline
assembly. This would require distinguishing multiply+add (return a
value) from multiply+accumulate (modify one of the inputs).
Multiply+accumulate is extremely common in gemm/gemv and probably
worth optimizing for. I'm willing to take a crack at adding such a
method (pmacc? pmadd_inplace? pfma?), although I might need a few
pointers -- I'm new to both C++ templating and Eigen -- and would
probably need help with SSE/AltiVec. So before I start on it: Is there
interest in such an addition?


(2)

The various NEON predux implementations go by way of a local variable.
I think that's unnecessary and over-complicated, and it doesn't get
compiled away (at least by gcc at -O3). For example, it appears that:

template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
{
  float32x2_t a_lo, a_hi, sum;
  float s[2];

  a_lo = vget_low_f32(a);
  a_hi = vget_high_f32(a);
  sum = vpadd_f32(a_lo, a_hi);
  sum = vpadd_f32(sum, sum);
  vst1_f32(s, sum);

  return s[0];
}

could just be:

template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
{
  float32x2_t a_lo, a_hi, sum;
  a_lo = vget_low_f32(a);
  a_hi = vget_high_f32(a);
  sum = vpadd_f32(a_lo, a_hi);
  sum = vpadd_f32(sum, sum);
  return vget_lane_f32(sum, 0);
}


Assembly excerpt for the former:

	add	r1, sp, #216
	vpadd.f32	d20, d20, d21
	vpadd.f32	d18, d18, d19
	vpadd.f32	d16, d16, d17
	vpadd.f32	d21, d22, d22
	vpadd.f32	d19, d20, d20
	vpadd.f32	d17, d18, d18
	vpadd.f32	d16, d16, d16
	vst1.32	{d21}, [r1]
	vldr.32	s8, [sp, #216]
	vst1.32	{d19}, [r1]
	vadd.f32	d0, d4, d0
	vldr.32	s10, [sp, #216]
	vst1.32	{d17}, [r1]
	vadd.f32	d3, d5, d3
	vldr.32	s8, [sp, #216]
	vst1.32	{d16}, [r1]
	vadd.f32	d2, d4, d2
	vldr.32	s10, [sp, #216]
	vadd.f32	d1, d5, d1

And for the latter -- contains no gratuitous loads and stores:

	vpadd.f32	d17, d18, d19
	vpadd.f32	d18, d20, d21
	vpadd.f32	d20, d22, d23
	vpadd.f32	d4, d16, d16
	vpadd.f32	d5, d17, d17
	vpadd.f32	d6, d18, d18
	vpadd.f32	d7, d20, d20
	vadd.f32	d1, d4, d1
	vadd.f32	d2, d5, d2
	vadd.f32	d3, d6, d3
	vadd.f32	d0, d7, d0

My tests using those two implementations show them yielding identical results.

Questions: Is there something I don't know about that makes the
current way safer or more correct? And if not, is there interest in a
patch fixing all of these?


(3)

Is there a good way to run the existing Eigen unit tests on an iOS or
Android device? (E.g. is there an Xcode-friendly test wrapper? Is
there interest in one?) I don't have -- or intend to acquire -- a
beagle board, so testing NEON changes is currently a bit manual and ad
hoc.


Sorry for the long email.

-josh



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/