[eigen] Tensor Broadcast - performance improvements for special cases |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Hi Eigen Maintainers, Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the patch? Below is high level description of changes in TensorBroadcasting.h, 1.
Implemented two new “packet” functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column vector to a Matrix.
a.
NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)
i. This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation
b.
OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)
i. This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation 2.
Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD Testing: Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp 1.
NByOne with row and column major ordering 2.
OneByN with row and column major ordering Performance: Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne case show a much higher speed-up because the baseline uses
scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.
-Vamsi
-Vamsi |
Attachment:
TensorBroadcasting.diff
Description: TensorBroadcasting.diff
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |