Re: [eigen] Tensor Broadcast - performance improvements for special case

On Wed, May 30, 2018 at 8:56 PM, Sripathi, Vamsi <vamsi.sripathi@xxxxxxxxx> wrote:

Thanks, Gael for trying out the patch.

Regarding the PR you mentioned, the changes there are targeted specifically for one case i.e., when there is no broadcast on any of the input dimensions, which basically makes it a simple element-to-element copy operation from input to output tensors.. It’s beneficial to have that PR also in addition to my patch since it handles the simple copy operation by completely avoiding the index calculations and doing a quick return with a vector load op.

-Vamsi

From: Gael Guennebaud [mailto:gael.guennebaud@gmail.com]
Sent: Tuesday, May 29, 2018 8:48 AM
To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] Tensor Broadcast - performance improvements for special cases

Hi,

This looks ok to me, and I observe a x10 speed up on simple code as this one: https://stackoverflow.com/a/50512528/1641621. It probably also cancel the need for this PR:

https://bitbucket.org/rmlarsen/eigen3/commits/4ed934036eb225af0e3f4e2bc3fa7288a45f5927

gael

On Wed, May 23, 2018 at 11:30 PM, Sripathi, Vamsi <vamsi.sripathi@xxxxxxxxx> wrote:

Hi Eigen Maintainers,

Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the patch?

Below is high level description of changes in TensorBroadcasting.h,

1.       Implemented two new “packet” functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column vector to a Matrix.

a.       NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)

                                                               i.      This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation

b.      OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)

                                                               i.      This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation

2.       Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD

Testing:

Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp

1.       NByOne with row and column major ordering

2.       OneByN with row and column major ordering

Performance:

Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne case show a much higher speed-up because the baseline uses scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.

NByOne (Inputs: NxNx1 with Bcast: 1x1x32)

OneByN (Inputs: 1xNxN with Bcast: 32xNxN)

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

64x64x32

2182

92

23.71739

32x64x64

303

87

3.482759

128x128x32

8599

325

26.45846

32x128x128

1184

294

4.027211

192x192x32

19459

722

26.95152

32x192x192

2622

652

4.021472

256x256x32

34056

1163

29.28289

32x256x256

4578

1127

4.062112

320x320x32

53492

1780

30.05169

32x320x320

7112

1675

4.24597

384x384x32

77069

2525

30.52238

32x384x384

10201

2385

4.277149

448x448x32

105110

3400

30.91471

32x448x448

13885

3212

4.322852

512x512x32

136317

4424

30.81307

32x512x512

18074

4186

4.317726

576x576x32

172864

5572

31.02369

32x576x576

22853

5346

4.274785

640x640x32

213730

6970

30.66428

32x640x640

28259

6582

4.293376

704x704x32

258884

8518

30.39258

32x704x704

34248

7926

4.320969

768x768x32

308352

10120

30.46957

32x768x768

40811

9499

4.296347

832x832x32

362219

11947

30.31882

32x832x832

47932

11133

4.305398

896x896x32

420417

13945

30.14823

32x896x896

55564

12929

4.297625

960x960x32

482870

15912

30.34628

32x960x960

63772

14963

4.26198

1024x1024x32

545371

18228

29.91941

32x1024x1024

72404

16998

4.25956

-Vamsi

-Vamsi

NByOne (Inputs: NxNx1 with Bcast: 1x1x32)				OneByN (Inputs: 1xNxN with Bcast: 32xNxN)
Output Tensor	Baseline (Time ms)	Tuned (Time ms)	Speedup	Output Tensor	Baseline (Time ms)	Tuned (Time ms)	Speedup
64x64x32	2182	92	23.71739	32x64x64	303	87	3.482759
128x128x32	8599	325	26.45846	32x128x128	1184	294	4.027211
192x192x32	19459	722	26.95152	32x192x192	2622	652	4.021472
256x256x32	34056	1163	29.28289	32x256x256	4578	1127	4.062112
320x320x32	53492	1780	30.05169	32x320x320	7112	1675	4.24597
384x384x32	77069	2525	30.52238	32x384x384	10201	2385	4.277149
448x448x32	105110	3400	30.91471	32x448x448	13885	3212	4.322852
512x512x32	136317	4424	30.81307	32x512x512	18074	4186	4.317726
576x576x32	172864	5572	31.02369	32x576x576	22853	5346	4.274785
640x640x32	213730	6970	30.66428	32x640x640	28259	6582	4.293376
704x704x32	258884	8518	30.39258	32x704x704	34248	7926	4.320969
768x768x32	308352	10120	30.46957	32x768x768	40811	9499	4.296347
832x832x32	362219	11947	30.31882	32x832x832	47932	11133	4.305398
896x896x32	420417	13945	30.14823	32x896x896	55564	12929	4.297625
960x960x32	482870	15912	30.34628	32x960x960	63772	14963	4.26198
1024x1024x32	545371	18228	29.91941	32x1024x1024	72404	16998	4.25956