Hi Eigen Maintainers,
Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the
patch?
Below is high level description of changes in TensorBroadcasting.h,
1. Implemented two new “packet” functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column
vector to a Matrix.
a. NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)
i. This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation
b. OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)
i. This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation
2. Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD
Testing:
Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp
1. NByOne with row and column major ordering
2. OneByN with row and column major ordering
Performance:
Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne
case show a much higher speed-up because the baseline uses scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.
NByOne (Inputs: NxNx1 with Bcast: 1x1x32)
|
|
OneByN (Inputs: 1xNxN with Bcast: 32xNxN)
|
Output Tensor
|
Baseline (Time ms)
|
Tuned (Time ms)
|
Speedup
|
|
Output Tensor
|
Baseline (Time ms)
|
Tuned (Time ms)
|
Speedup
|
64x64x32
|
2182
|
92
|
23.71739
|
|
32x64x64
|
303
|
87
|
3.482759
|
128x128x32
|
8599
|
325
|
26.45846
|
|
32x128x128
|
1184
|
294
|
4.027211
|
192x192x32
|
19459
|
722
|
26.95152
|
|
32x192x192
|
2622
|
652
|
4.021472
|
256x256x32
|
34056
|
1163
|
29.28289
|
|
32x256x256
|
4578
|
1127
|
4.062112
|
320x320x32
|
53492
|
1780
|
30.05169
|
|
32x320x320
|
7112
|
1675
|
4.24597
|
384x384x32
|
77069
|
2525
|
30.52238
|
|
32x384x384
|
10201
|
2385
|
4.277149
|
448x448x32
|
105110
|
3400
|
30.91471
|
|
32x448x448
|
13885
|
3212
|
4.322852
|
512x512x32
|
136317
|
4424
|
30.81307
|
|
32x512x512
|
18074
|
4186
|
4.317726
|
576x576x32
|
172864
|
5572
|
31.02369
|
|
32x576x576
|
22853
|
5346
|
4.274785
|
640x640x32
|
213730
|
6970
|
30.66428
|
|
32x640x640
|
28259
|
6582
|
4.293376
|
704x704x32
|
258884
|
8518
|
30.39258
|
|
32x704x704
|
34248
|
7926
|
4.320969
|
768x768x32
|
308352
|
10120
|
30.46957
|
|
32x768x768
|
40811
|
9499
|
4.296347
|
832x832x32
|
362219
|
11947
|
30.31882
|
|
32x832x832
|
47932
|
11133
|
4.305398
|
896x896x32
|
420417
|
13945
|
30.14823
|
|
32x896x896
|
55564
|
12929
|
4.297625
|
960x960x32
|
482870
|
15912
|
30.34628
|
|
32x960x960
|
63772
|
14963
|
4.26198
|
1024x1024x32
|
545371
|
18228
|
29.91941
|
|
32x1024x1024
|
72404
|
16998
|
4.25956
|
-Vamsi
-Vamsi