|[eigen] Status of my Eigen work|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
I just wanted to give you a bit of info about the status of my work on
Eigen to improve support for ARM NEON.
As I mentioned previously, I am implementing a computer vision algorithm
using a Array-of-Structs-of-Arrays design where the inner Arrays are
designed to map to SIMD packets.
In order to do this, I have spent the last few weeks adding features to
1. Fixed GCC warnings when building tests (merged)
2. Fixes to IO printing of char and unsigned char Arrays (merged).
3. Improvements to GeneralBlockPanelKernel to support packets that have
more than 4 lanes (MR !25)
4. Various tidy-ups and improvements to existing NEON packetmath. (MR !4)
5. Add support for int8_t, uint8_t, int16_t, uint16_t, and uint32_t,
packetmath for NEON.
6. Added support for pnot, pselect, pinsertfirst, and pinsertlast to
NEON packetmath, as well as pcast and preinterpret for all combinations
7. Added a new coeff-wise binary function: array_difference, and
implemented support for ARM neon with new pabsdiff packetmath function.
8. Added a new pcombine packetmath function which can join an array of
half/quarter packets into a single packet.
9. Added support in CoreEvaluators for packet-math when type conversion
is required e.g. cast. Previously a single packet-type was used from the
asignment evaluator through to the call to ploadt.
10. Added a new pmask_cast packetmath function which is used to cast
mask-packets to different types.
11. Added support for packet math in the evaluation of Select, making
use of pcast_mask where necessary to implicitly cast the condition mask
packet type to match the value types.
12. Added support for packet access in scalar_boolean_and_op and
scalar_boolean_or_op with new packet math functions: plogical_and,
13. (Nearly complete) Added support for bool packetmath to NEON
The complete branch is available to view here:
In addition to these patches I have a hacky patch for force certain
functions inline. Hopefully I can find a way to make gcc do this
There is also the GCC bug I discussed previously with the compiler
failing to eliminate NEON loads/stores to the stack:
However, setting these issues aside, with these patches, the generated
machine code is now looking reasonably credible.
So I wanted to talk about a path forward to getting all these things
reviewed and merged upstream. I didn't want to dump this branch in a
single merge request - I think it would become a bit unmanageable, so I
think it would be better to break it up into a series of merge requests.
However, there is an issue here with the merge requests getting
backlogged. It seems like it will take a very long time to get the 2+9
outstanding merge requests merged unless there is a concerted effort to
get them through the review process.
Items 6-13 can mostly be submitted as parallel merge requests, but items
1-5 must be merged first. At the moment the two existing MRs seem to be
I understand that it's a lot of work to review all this, and maintainers
of this project probably have other things to work on - so if there's
anything I can do to make it easier to get this stuff reviewed, please
let me know.