Re: [eigen] Tensor Broadcast - performance improvements for special cases |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Thanks, Gael for trying out the patch.
Regarding the PR you mentioned, the changes there are targeted specifically for one case i.e., when there is no broadcast on any of the input dimensions, which basically makes it a simple element-to-element copy operation from input to output tensors.. It’s beneficial to have that PR also in addition to my patch since it handles the simple copy operation by completely avoiding the index calculations and doing a quick return with a vector load op.
-Vamsi
From: Gael Guennebaud [mailto:gael.guennebaud@gmail.
com ]
Sent: Tuesday, May 29, 2018 8:48 AM
To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] Tensor Broadcast - performance improvements for special cases
Hi,
This looks ok to me, and I observe a x10 speed up on simple code as this one: https://stackoverflow.
com/a/50512528/1641621 . It probably also cancel the need for this PR:
gael
On Wed, May 23, 2018 at 11:30 PM, Sripathi, Vamsi <vamsi.sripathi@xxxxxxxxx> wrote:
Hi Eigen Maintainers,
Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the patch?
Below is high level description of changes in TensorBroadcasting.h,
1. Implemented two new “packet” functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column vector to a Matrix.
a. NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)
i. This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation b. OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)
i. This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation 2. Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD
Testing:
Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp
1. NByOne with row and column major ordering
2. OneByN with row and column major ordering
Performance:
Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne case show a much higher speed-up because the baseline uses scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.
NByOne (Inputs: NxNx1 with Bcast: 1x1x32)
OneByN (Inputs: 1xNxN with Bcast: 32xNxN)
Output Tensor
Baseline (Time ms)
Tuned (Time ms)
Speedup
Output Tensor
Baseline (Time ms)
Tuned (Time ms)
Speedup
64x64x32
2182
92
23.71739
32x64x64
303
87
3.482759
128x128x32
8599
325
26.45846
32x128x128
1184
294
4.027211
192x192x32
19459
722
26.95152
32x192x192
2622
652
4.021472
256x256x32
34056
1163
29.28289
32x256x256
4578
1127
4.062112
320x320x32
53492
1780
30.05169
32x320x320
7112
1675
4.24597
384x384x32
77069
2525
30.52238
32x384x384
10201
2385
4.277149
448x448x32
105110
3400
30.91471
32x448x448
13885
3212
4.322852
512x512x32
136317
4424
30.81307
32x512x512
18074
4186
4.317726
576x576x32
172864
5572
31.02369
32x576x576
22853
5346
4.274785
640x640x32
213730
6970
30.66428
32x640x640
28259
6582
4.293376
704x704x32
258884
8518
30.39258
32x704x704
34248
7926
4.320969
768x768x32
308352
10120
30.46957
32x768x768
40811
9499
4.296347
832x832x32
362219
11947
30.31882
32x832x832
47932
11133
4.305398
896x896x32
420417
13945
30.14823
32x896x896
55564
12929
4.297625
960x960x32
482870
15912
30.34628
32x960x960
63772
14963
4.26198
1024x1024x32
545371
18228
29.91941
32x1024x1024
72404
16998
4.25956
-Vamsi
-Vamsi
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |