[eigen] (FYI - no action needed) benchmarking various GEMM kernels on AR

Hi List,

This is a data-only email. In my work at Google on gemmlowp, a matrix library which focuses on 8-bit fixed-point matrix multiplication primarily intended for mobile neural network inference, I have benchmarked a variety of GEMM kernels, most of them written in ARM 32-bit or 64-bit assembly, some written in C++/intrinsics just to check how that compares, in this fully self-contained program:

https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc

Further, this file has received contributions directly from ARM showing how to achieve the best performance on various ARM cores, and they annotated the assembly with very helpful comments, too, so I hope that this material might be useful to other people interested in ARM GEMM kernels.

Here are benchmark results on various Android ARM devices:

https://docs.google.com/spreadsheets/d/1UPbzbp9rdsD6RXxOr5q6AZ0n1omgEknLYO2ogiw6Kqk/edit#gid=0

Notice that even though gemmlowp is only interested in the first of these kinds of kernels, there are 3 kinds of kernels here:

1. 8bit*8bit with internal 32bit accumulators

2. 32-bit integer (like Eigen::Matrix<int32_t, ...>)

3. 32-bit floating point (like Eigen::MatrixXf).

The float results give a few data points that may inspire changes to Eigen's GEMM kernels and SIMD wrappers. Indeed, Eigen's GEMM kernels (last I checked) load one RHS scalar value at a time, duplicate it onto all lanes of a SIMD register, and multiply that against a LHS SIMD register. See the 'WithVectorDuplicatingScalar' rows in the spreadsheet. That approach, inspired by x86, does not allow to achieve optimal performance on ARM, where the multiplication (and mul-add) instructions allow multiplying one SIMD register by *one specific lane* of another SIMD register, allowing for significantly simpler GEMM kernels: see the 'WithScalar' rows in the spreadsheet. Addressing that in Eigen would require some changes in PacketMath.h SIMD wrappers; it may not be trivial to arrive at an abstraction that maps efficiently to both ARM and x86.

Another data point perhaps implicit in this spreadsheet is the case for writing GEMM kernels in assembly; it is still difficult to approach the same level of performance in C++ with intrinsics. That may be specific to ARM or to mobile though; x86 desktop CPUs may be less sensitive to such details and x86 toolchains more mature than their ARM counterparts.

Cheers,

Benoit