[eigen] Tensor Broadcast - performance improvements for special cases

Hi Eigen Maintainers,

Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the patch?

Below is high level description of changes in TensorBroadcasting.h,

1. Implemented two new “packet” functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column vector to a Matrix.

a. NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)

i. This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation

b. OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)

i. This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation

2. Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD

Testing:

Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp

1. NByOne with row and column major ordering

2. OneByN with row and column major ordering

Performance:

Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne case show a much higher speed-up because the baseline uses scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.

NByOne (Inputs: NxNx1 with Bcast: 1x1x32)				OneByN (Inputs: 1xNxN with Bcast: 32xNxN)
Output Tensor	Baseline (Time ms)	Tuned (Time ms)	Speedup	Output Tensor	Baseline (Time ms)	Tuned (Time ms)	Speedup
64x64x32	2182	92	23.71739	32x64x64	303	87	3.482759
128x128x32	8599	325	26.45846	32x128x128	1184	294	4.027211
192x192x32	19459	722	26.95152	32x192x192	2622	652	4.021472
256x256x32	34056	1163	29.28289	32x256x256	4578	1127	4.062112
320x320x32	53492	1780	30.05169	32x320x320	7112	1675	4.24597
384x384x32	77069	2525	30.52238	32x384x384	10201	2385	4.277149
448x448x32	105110	3400	30.91471	32x448x448	13885	3212	4.322852
512x512x32	136317	4424	30.81307	32x512x512	18074	4186	4.317726
576x576x32	172864	5572	31.02369	32x576x576	22853	5346	4.274785
640x640x32	213730	6970	30.66428	32x640x640	28259	6582	4.293376
704x704x32	258884	8518	30.39258	32x704x704	34248	7926	4.320969
768x768x32	308352	10120	30.46957	32x768x768	40811	9499	4.296347
832x832x32	362219	11947	30.31882	32x832x832	47932	11133	4.305398
896x896x32	420417	13945	30.14823	32x896x896	55564	12929	4.297625
960x960x32	482870	15912	30.34628	32x960x960	63772	14963	4.26198
1024x1024x32	545371	18228	29.91941	32x1024x1024	72404	16998	4.25956

-Vamsi

-Vamsi