RE: [eigen] Tensor Broadcast - performance improvements for special case

Thanks, Gael for trying out the patch.

Regarding the PR you mentioned, the changes there are targeted specifically for one case i.e., when there is no broadcast on any of the input dimensions, which basically makes it a simple element-to-element copy operation from input to output tensors. It’s beneficial to have that PR also in addition to my patch since it handles the simple copy operation by completely avoiding the index calculations and doing a quick return with a vector load op.

-Vamsi

From: Gael Guennebaud [mailto:gael.guennebaud@xxxxxxxxx]
Sent: Tuesday, May 29, 2018 8:48 AM
To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] Tensor Broadcast - performance improvements for special cases

Hi,

This looks ok to me, and I observe a x10 speed up on simple code as this one: https://stackoverflow.com/a/50512528/1641621. It probably also cancel the need for this PR:

https://bitbucket.org/rmlarsen/eigen3/commits/4ed934036eb225af0e3f4e2bc3fa7288a45f5927

gael

On Wed, May 23, 2018 at 11:30 PM, Sripathi, Vamsi <vamsi.sripathi@xxxxxxxxx> wrote:

Hi Eigen Maintainers,

Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the patch?

Below is high level description of changes in TensorBroadcasting.h,

1.       Implemented two new “packet” functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column vector to a Matrix.

a.       NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)

                                                               i.      This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation

b.      OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)

                                                               i.      This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation

2.       Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD

Testing:

Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp

1.       NByOne with row and column major ordering

2.       OneByN with row and column major ordering

Performance:

Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne case show a much higher speed-up because the baseline uses scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.

NByOne (Inputs: NxNx1 with Bcast: 1x1x32)

OneByN (Inputs: 1xNxN with Bcast: 32xNxN)

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

64x64x32

2182

92

23.71739

32x64x64

303

87

3.482759

128x128x32

8599

325

26.45846

32x128x128

1184

294

4.027211

192x192x32

19459

722

26.95152

32x192x192

2622

652

4.021472

256x256x32

34056

1163

29.28289

32x256x256

4578

1127

4.062112

320x320x32

53492

1780

30.05169

32x320x320

7112

1675

4.24597

384x384x32

77069

2525

30.52238

32x384x384

10201

2385

4.277149

448x448x32

105110

3400

30.91471

32x448x448

13885

3212

4.322852

512x512x32

136317

4424

30.81307

32x512x512

18074

4186

4.317726

576x576x32

172864

5572

31.02369

32x576x576

22853

5346

4.274785

640x640x32

213730

6970

30.66428

32x640x640

28259

6582

4.293376

704x704x32

258884

8518

30.39258

32x704x704

34248

7926

4.320969

768x768x32

308352

10120

30.46957

32x768x768

40811

9499

4.296347

832x832x32

362219

11947

30.31882

32x832x832

47932

11133

4.305398

896x896x32

420417

13945

30.14823

32x896x896

55564

12929

4.297625

960x960x32

482870

15912

30.34628

32x960x960

63772

14963

4.26198

1024x1024x32

545371

18228

29.91941

32x1024x1024

72404

16998

4.25956

-Vamsi

-Vamsi