|[eigen] (FYI - no action needed) benchmarking various GEMM kernels on ARM cores|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
- Subject: [eigen] (FYI - no action needed) benchmarking various GEMM kernels on ARM cores
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Mon, 15 May 2017 17:18:08 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=zBPO+2aD7nVNVX4ulloHgv4t+MJlaQfj0MjWOIIJNPM=; b=GRTfJOLPcjT1dZJ06bEwaoLCXfL5ZehsafeIzSxaqV1Vf8zbqbykd2NbigGD7RXvMV 8YFWYFk2y6EH6j46W2OA1WxHyL99VOej7vLXHXQeXGNI5FDMOc5UT5es8mP2IZ5BBoAQ llVZWipiZ+AsgFwAkTXtzE3SubD0DClYyyxtkQqxjohkXCe76Q07NEtm00J8fov9NQ6D FAndGL0NgZccnYKSVqsHu4bxdAV3xDKvM3at6zRj070n9nJjGtqwBjfuhMxryJMzDp4/ S68He+BhDKVdbJKlzOb8X8C78cUpNgFPYTERedUuw5eJeUKxgbryfIELpYjTSpgOgJUh fPHw==
This is a data-only email. In my work at Google on gemmlowp
, a matrix library which focuses on 8-bit fixed-point matrix multiplication primarily intended for mobile neural network inference, I have benchmarked a variety of GEMM kernels, most of them written in ARM 32-bit or 64-bit assembly, some written in C++/intrinsics just to check how that compares, in this fully self-contained program:https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc
Further, this file has received contributions directly from ARM showing how to achieve the best performance on various ARM cores, and they annotated the assembly with very helpful comments, too, so I hope that this material might be useful to other people interested in ARM GEMM kernels.
Here are benchmark results on various Android ARM devices:https://docs.google.com/spreadsheets/d/1UPbzbp9rdsD6RXxOr5q6AZ0n1omgEknLYO2ogiw6Kqk/edit#gid=0
Notice that even though gemmlowp is only interested in the first of these kinds of kernels, there are 3 kinds of kernels here:
1. 8bit*8bit with internal 32bit accumulators
2. 32-bit integer (like Eigen::Matrix<int32_t, ...>)
3. 32-bit floating point (like Eigen::MatrixXf).
The float results give a few data points that may inspire changes to Eigen's GEMM kernels and SIMD wrappers. Indeed, Eigen's GEMM kernels (last I checked) load one RHS scalar value at a time, duplicate it onto all lanes of a SIMD register, and multiply that against a LHS SIMD register. See the 'WithVectorDuplicatingScalar' rows in the spreadsheet. That approach, inspired by x86, does not allow to achieve optimal performance on ARM, where the multiplication (and mul-add) instructions allow multiplying one SIMD register by *one specific lane* of another SIMD register, allowing for significantly simpler GEMM kernels: see the 'WithScalar' rows in the spreadsheet. Addressing that in Eigen would require some changes in PacketMath.h SIMD wrappers; it may not be trivial to arrive at an abstraction that maps efficiently to both ARM and x86.
Another data point perhaps implicit in this spreadsheet is the case for writing GEMM kernels in assembly; it is still difficult to approach the same level of performance in C++ with intrinsics. That may be specific to ARM or to mobile though; x86 desktop CPUs may be less sensitive to such details and x86 toolchains more mature than their ARM counterparts.