Re: [eigen] Eigen Arm SVE backend RFC

+Miguel directly.

On Mon, Jun 22, 2020 at 3:15 PM Rasmus Munk Larsen <rmlarsen@xxxxxxxxxx> wrote:

Miguel,

Thank you very much for the RFC. I think that support for Arm SVE would be a useful addition to Eigen. As you mention, doing it with fixed-sized vectors will probably be necessary to match the existing Eigen architecture. Could we make the vector length a build config macro without a lot of code duplication for different lengths?

Could I ask your team to submit this as a merge request against head on the main branch for easier review and testing?

Best regards,
Rasmus

On Wed, Jun 17, 2020 at 2:48 AM Miguel Tairum-Cruz <Miguel.Tairum-Cruz@xxxxxxx> wrote:

Hi all,

I would like to present to the Eigen community a Request for Comments (RFC) for a new proof-of-concept vector backend based on the Arm Scalable Vector Length (SVE) architecture.

With Eigen being widely used across multiple projects such as TensorFlow, we believe that adding support to this new vector length (VL) agnostic architecture will benefit performance on upcoming Arm micro-architectures and systems.

This proof-of-concept SVE backend keeps in line with the existent vector backends, using the Arm C Language Extensions (ACLE) for SVE to optimize Eigen’s functions.
Using the NEON backend as a starting point, we have ported most of NEON functions to SVE. Please be aware that this work is built upon a version of Eigen from December 2019 / January 2020. All the upstream commits made to the NEON backend since then are not yet considered in this version.

The introduced changes are provided in the form of patch files, specifically for two SVE vector lengths: 128-bit and 512-bit. You can find more information on how to apply them in the provided README file.

One caveat of this initial version is the requirement for fixed SVE vector lengths. Eigen codebase and vector optimizations are not fully compatible with the vector-length agnostic data types that SVE introduces, which is a barrier for its full support upstream. Optimizing the SVE backend for specific VLs (in this case 128-bit and 512-bit) is a necessary workaround for this initial proof-of-concept.

An additional goal of this work is to integrate the Eigen SVE backend with TensorFlow. So far, due to the caveats stated above, we have not been able to integrate TensorFlow with Eigen SVE. However, the recent release of GCC 10.1 brings a new feature to enable fixed vector sizes at compile time, which we believe will allow building TensorFlow with the proof-of-concept fixed-VL SVE implementation of Eigen.

Below is the formal RFC document, where we detail the design choices and discuss drawbacks and potential solutions to enable a complete implementation of an SVE backend for Eigen.

Regards,

Miguel

--------

Eigen Arm SVE backend RFC

- Authors: Miguel Tairum (miguel.tairum-cruz@xxxxxxx)

- Updated: 2020-05-15

Summary

The purpose of this RFC is to share an experimental proof-of-concept Arm Scalable Vector Extension (SVE) backend to Eigen and engage with the Eigen development community on feedback and ideas on how to properly implement scalable vectors into the Eigen library codebase.

More information on how to apply the RFC patch can be found in the README file.

Motivation

SVE is the next-generation SIMD architectural extension to the Armv8 architecture, introducing scalable vector length, per-lane predication, gather-loads, scatter-stores amongst other features.

Eigen is a mature linear algebra library, supporting many vector architectures, including Arm NEON. Used in multiple projects, including TensorFlow, we believe that supporting SVE could not only improve compatibility with future micro-architectures, but also enable better performance.

Guide-level explanation

In this initial assessment, we present a proof-of-concept SVE port of the PacketMath backend in Eigen, using the Arm C Language Extensions (ACLE). Like the existent vector backends, SVE intrinsics are implemented in Eigen's PacketMath, MathFunctions and TypeCasting source files. In this initial release, complex math is not available (due to time constraints).

This proof-of-concept release provides a "fixed-sized" SVE backend, with vector lengths of 128 and 512 bits. This means that the implemented functions are validated only when executed on those specific SVE lengths, as optimizations were only made for them. To facilitate this, we provide a patch file for each VL. All currently implemented NEON functions except for the Complex math (Complex.h) are included in the SVE backend. This is up to date with commit 312c8e77 from December 2019, plus the changes introduced to the NEON backend up until commit da5a7afe from 10 January 2020 (these are included in the patches files). This commit was chosen to be compatible with TensorFlow 1.x, which uses a similar version of Eigen, plus any NEON updates at the time of this work. This initial release also contains an updated PacketMath test, with SVE validation.

Reference-level explanation

The changes presented in this RFC are based from commit 312c8e77 in the master branch.

The Eigen SVE backend can be found at Eigen/src/Core/arch/SVE.
SVE intrinsics are implemented for float, int and double sized elements. Similar to the NEON backend at this time, half packets are not implemented. Therefore, the available packet sizes for 512-bit VL are: 16 elements for int/float, 8 elements for double; and for 128-bit VL are: 4 elements for int/float, 2 elements for double.

For most functions, SVE intrinsics are analogous to the ones used in the NEON backend. More complex functions have comments that explain the logic behind their implementation.

Regarding the ptranspose function, the PacketBlock structure was duplicated and modified into PacketBlockSVE, a new structure of SVE vector pointers. This structure is in Eigen/src/Core/GenericPacketMath.h. This is required to support vector length agnostic data types, introduced in SVE. Since these data types do not have a fixed sized at compile time, they cannot be addressed inside vectors and thus pointers are needed.
The included SVE PacketMath tests (available in /test/packetmath.cc and /test/packetmath_sve_resnet.c) make use of this new structure to validate the transpose function.

Outside of PacketMath and the previously mentioned locations, other small SVE modifications were done whenever a NEON implementation was present in the code. Additionally, the cmake files were also modified to accommodate the new backend.

Drawbacks and future possibilities

The initial release demonstrates a proof of concept for an SVE backend with 128 and 512-bit vector lengths. Although it can be compiled for SVE architectures with different vector lengths, some functions will not validate, as they were tuned for these specific VLs.

One of main features of SVE, Vector Length Agnosticism (VLA), is not fully supported by Eigen, which relies on fixed-vector sizes to better exploit vector performance. SVE vectors have sizeless types, identified by the size of their elements, independently of the maximum vector length set. As such, some structures in Eigen's backend are not compatible with these types, like PacketBlock, a structure containing an array of Packets. This structure is then called in other parts of the projects (e.g. transpose function), that require a workaround to support these data types.

Work still needs to be done to either abstract the vector length in function optimization, or to consider all possible SVE vector lengths and to optimize accordingly. In order to fully integrate a vector length agnostic SVE backend with Eigen, changes to Eigen's core are also required. The aforementioned PacketBlock is one of them, but the code needs to be revised in order to seamlessly support sizeless vectors without breaking support to all existent fixed-sized vector architectures. Ultimately, this would ensure compatibility with other projects such as TensorFlow, which currently cannot be built with Eigen SVE. As it stands in the proof-of-concept, benchmarks need to be carefully written to use the SVE backend.

As of mid-May, GCC 10.1 stable build has been released, bringing the feature to create fixed-length SVE types. This enables the substitution of sizeless data types for fixed size ones, solving the above incompatibility with the PacketBlock structure. However, this is not a complete solution, as it does not bring support for the desired SVE VLA.
We are currently performing some tests and evaluating this GCC feature with a TensorFlow build. The goal is to be able to build Tensorflow and run some benchmark using the proof-of-concept Eigen with the SVE backend and a fixed VL.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.