[eigen] NEON PacketMath.h questions |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: [eigen] NEON PacketMath.h questions
- From: Josh Bleecher Snyder <josharian@xxxxxxxxx>
- Date: Tue, 3 Jul 2012 15:05:38 -0700
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=t3hSz/XaR6nAsrSfWXqaMXrvX6fyGKOIJEhWHpR32bs=; b=XHFiDBnoTkO83t5TKg6QfHI8GDxprZl3Clq7A3bJmxk8byZOvcLdO0mHmkfc+y1eoI Stpnem0F24CrA+dWfHwxcPT18XM6zYH5IhBHCcSilu2LBtOcL2X0BtKSFy1MR8sqmCI8 5RHCZEPZsAq/IuY3+ZHHCnH5aWWji3A5ZnwontEcyqqqdUBl/xfOt63YAVqaJ8rAgg5I qPGWj2sG68IHifURTmbUeBxzwEGZ4EVhyamdOMK+qO62NWmFhYS49sjmmzFrwRyzHCqk udhM6fOvA0oHN2znZAU5DHD+Icj52EBeNUbbOlJTi9UtF2fo4dqI1uhE2siaH7TOASwv swyg==
Hi,
I've been poking through the NEON PacketMath.h. I have three questions
/ suggestions.
(1)
pmadd is currently implemented as:
template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
Packet4i& b, const Packet4i& c) { return padd(pmul(a,b), c); }
However, NEON has a fused multiply accumulate instruction, which is
both faster and more accurate. This replacement uses the corresponding
intrinsic for it:
template<> EIGEN_STRONG_INLINE Packet4f pmadd(const Packet4f& a, const
Packet4f& b, const Packet4f& c) { return vmlaq_f32(c, a, b); }
template<> EIGEN_STRONG_INLINE Packet4i pmadd(const Packet4i& a, const
Packet4i& b, const Packet4i& c) { return vmlaq_s32(c, a, b); }
Unfortunately, both gcc and clang generate vmul + vadd instructions
out of this intrinsic (even when real accumulation is possible),
rather than a single vmla instruction as requested. Still, it's the
Right Thing To Do, I think, and compiler support will hopefully
improve.
One could also force the use of the vmla instruction via inline
assembly. This would require distinguishing multiply+add (return a
value) from multiply+accumulate (modify one of the inputs).
Multiply+accumulate is extremely common in gemm/gemv and probably
worth optimizing for. I'm willing to take a crack at adding such a
method (pmacc? pmadd_inplace? pfma?), although I might need a few
pointers -- I'm new to both C++ templating and Eigen -- and would
probably need help with SSE/AltiVec. So before I start on it: Is there
interest in such an addition?
(2)
The various NEON predux implementations go by way of a local variable.
I think that's unnecessary and over-complicated, and it doesn't get
compiled away (at least by gcc at -O3). For example, it appears that:
template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
{
float32x2_t a_lo, a_hi, sum;
float s[2];
a_lo = vget_low_f32(a);
a_hi = vget_high_f32(a);
sum = vpadd_f32(a_lo, a_hi);
sum = vpadd_f32(sum, sum);
vst1_f32(s, sum);
return s[0];
}
could just be:
template<> EIGEN_STRONG_INLINE float predux<Packet4f>(const Packet4f& a)
{
float32x2_t a_lo, a_hi, sum;
a_lo = vget_low_f32(a);
a_hi = vget_high_f32(a);
sum = vpadd_f32(a_lo, a_hi);
sum = vpadd_f32(sum, sum);
return vget_lane_f32(sum, 0);
}
Assembly excerpt for the former:
add r1, sp, #216
vpadd.f32 d20, d20, d21
vpadd.f32 d18, d18, d19
vpadd.f32 d16, d16, d17
vpadd.f32 d21, d22, d22
vpadd.f32 d19, d20, d20
vpadd.f32 d17, d18, d18
vpadd.f32 d16, d16, d16
vst1.32 {d21}, [r1]
vldr.32 s8, [sp, #216]
vst1.32 {d19}, [r1]
vadd.f32 d0, d4, d0
vldr.32 s10, [sp, #216]
vst1.32 {d17}, [r1]
vadd.f32 d3, d5, d3
vldr.32 s8, [sp, #216]
vst1.32 {d16}, [r1]
vadd.f32 d2, d4, d2
vldr.32 s10, [sp, #216]
vadd.f32 d1, d5, d1
And for the latter -- contains no gratuitous loads and stores:
vpadd.f32 d17, d18, d19
vpadd.f32 d18, d20, d21
vpadd.f32 d20, d22, d23
vpadd.f32 d4, d16, d16
vpadd.f32 d5, d17, d17
vpadd.f32 d6, d18, d18
vpadd.f32 d7, d20, d20
vadd.f32 d1, d4, d1
vadd.f32 d2, d5, d2
vadd.f32 d3, d6, d3
vadd.f32 d0, d7, d0
My tests using those two implementations show them yielding identical results.
Questions: Is there something I don't know about that makes the
current way safer or more correct? And if not, is there interest in a
patch fixing all of these?
(3)
Is there a good way to run the existing Eigen unit tests on an iOS or
Android device? (E.g. is there an Xcode-friendly test wrapper? Is
there interest in one?) I don't have -- or intend to acquire -- a
beagle board, so testing NEON changes is currently a bit manual and ad
hoc.
Sorry for the long email.
-josh