[eigen] Tensor Broadcast - performance improvements for special cases

Discussion:

Sripathi, Vamsi

2018-05-23 21:30:13 UTC

Hi Eigen Maintainers,

Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the patch?

Below is high level description of changes in TensorBroadcasting.h,

1. Implemented two new "packet" functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column vector to a Matrix.

a. NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)

i. This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation

b. OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)

i. This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation

2. Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD

Testing:
Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp

1. NByOne with row and column major ordering

2. OneByN with row and column major ordering

Performance:
Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne case show a much higher speed-up because the baseline uses scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.

NByOne (Inputs: NxNx1 with Bcast: 1x1x32)

OneByN (Inputs: 1xNxN with Bcast: 32xNxN)

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

64x64x32

2182

92

23.71739

32x64x64

303

87

3.482759

128x128x32

8599

325

26.45846

32x128x128

1184

294

4.027211

192x192x32

19459

722

26.95152

32x192x192

2622

652

4.021472

256x256x32

34056

1163

29.28289

32x256x256

4578

1127

4.062112

320x320x32

53492

1780

30.05169

32x320x320

7112

1675

4.24597

384x384x32

77069

2525

30.52238

32x384x384

10201

2385

4.277149

448x448x32

105110

3400

30.91471

32x448x448

13885

3212

4.322852

512x512x32

136317

4424

30.81307

32x512x512

18074

4186

4.317726

576x576x32

172864

5572

31.02369

32x576x576

22853

5346

4.274785

640x640x32

213730

6970

30.66428

32x640x640

28259

6582

4.293376

704x704x32

258884

8518

30.39258

32x704x704

34248

7926

4.320969

768x768x32

308352

10120

30.46957

32x768x768

40811

9499

4.296347

832x832x32

362219

11947

30.31882

32x832x832

47932

11133

4.305398

896x896x32

420417

13945

30.14823

32x896x896

55564

12929

4.297625

960x960x32

482870

15912

30.34628

32x960x960

63772

14963

4.26198

1024x1024x32

545371

18228

29.91941

32x1024x1024

72404

16998

4.25956

-Vamsi

-Vamsi

Gael Guennebaud

2018-05-29 15:47:41 UTC

Permalink

Hi,

This looks ok to me, and I observe a x10 speed up on simple code as this
one: https://stackoverflow.com/a/50512528/1641621. It probably also cancel
the need for this PR:

https://bitbucket.org/rmlarsen/eigen3/commits/4ed934036eb225af0e3f4e2bc3fa7288a45f5927

gael

Post by Sripathi, Vamsi
Hi Eigen Maintainers,
Attached is a patch to tensor broadcast operation. It improves the
performance for some special cases by using better SIMD vectorization.
Could you please review and integrate the patch?
Below is high level description of changes in TensorBroadcasting.h,
1. Implemented two new âpacketâ functions (packetNByOne,
packetOneByN) that optimizes the following N-D tensor broadcast cases for
both row and column major ordering. Essentially, these are generalization
of a Row/Column vector to a Matrix.
a. NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn)
resulting in Output (d0 x d1 x .. dn)
i. This
function uses SIMD broadcast instruction where possible and in other cases
gathers input elements without index recalculation
b. OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..)
resulting in Output (d0 x d1 x d2 x ..)
i. This
function uses SIMD load instruction where possible and in other cases
gathers input elements without index recalculation
2. Modified the existing packet functions (Packet{Row,Col}Major) to
reduce index calculations when input stride is non-SIMD
Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp
1. NByOne with row and column major ordering
2. OneByN with row and column major ordering
Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on
an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and
OneByN cases respectively. NByOne case show a much higher speed-up because
the baseline uses scalar instructions whereas tuned uses SIMD
vbroadcasts{s,d}.
NByOne (Inputs: NxNx1 with Bcast: 1x1x32)
OneByN (Inputs: 1xNxN with Bcast: 32xNxN)
Output Tensor
Baseline (Time ms)
Tuned (Time ms)
Speedup
Output Tensor
Baseline (Time ms)
Tuned (Time ms)
Speedup
64x64x32
2182
92
23.71739
32x64x64
303
87
3.482759
128x128x32
8599
325
26.45846
32x128x128
1184
294
4.027211
192x192x32
19459
722
26.95152
32x192x192
2622
652
4.021472
256x256x32
34056
1163
29.28289
32x256x256
4578
1127
4.062112
320x320x32
53492
1780
30.05169
32x320x320
7112
1675
4.24597
384x384x32
77069
2525
30.52238
32x384x384
10201
2385
4.277149
448x448x32
105110
3400
30.91471
32x448x448
13885
3212
4.322852
512x512x32
136317
4424
30.81307
32x512x512
18074
4186
4.317726
576x576x32
172864
5572
31.02369
32x576x576
22853
5346
4.274785
640x640x32
213730
6970
30.66428
32x640x640
28259
6582
4.293376
704x704x32
258884
8518
30.39258
32x704x704
34248
7926
4.320969
768x768x32
308352
10120
30.46957
32x768x768
40811
9499
4.296347
832x832x32
362219
11947
30.31882
32x832x832
47932
11133
4.305398
896x896x32
420417
13945
30.14823
32x896x896
55564
12929
4.297625
960x960x32
482870
15912
30.34628
32x960x960
63772
14963
4.26198
1024x1024x32
545371
18228
29.91941
32x1024x1024
72404
16998
4.25956
-Vamsi
-Vamsi

Sripathi, Vamsi

2018-05-30 18:56:44 UTC

Permalink

Thanks, Gael for trying out the patch.

Regarding the PR you mentioned, the changes there are targeted specifically for one case i.e., when there is no broadcast on any of the input dimensions, which basically makes it a simple element-to-element copy operation from input to output tensors. Itâs beneficial to have that PR also in addition to my patch since it handles the simple copy operation by completely avoiding the index calculations and doing a quick return with a vector load op.

-Vamsi

From: Gael Guennebaud [mailto:***@gmail.com]
Sent: Tuesday, May 29, 2018 8:48 AM
To: eigen <***@lists.tuxfamily.org>
Subject: Re: [eigen] Tensor Broadcast - performance improvements for special cases

Hi,

This looks ok to me, and I observe a x10 speed up on simple code as this one: https://stackoverflow.com/a/50512528/1641621. It probably also cancel the need for this PR:

https://bitbucket.org/rmlarsen/eigen3/commits/4ed934036eb225af0e3f4e2bc3fa7288a45f5927

gael

On Wed, May 23, 2018 at 11:30 PM, Sripathi, Vamsi <***@intel.com<mailto:***@intel.com>> wrote:
Hi Eigen Maintainers,

Attached is a patch to tensor broadcast operation. It improves the performance for some special cases by using better SIMD vectorization. Could you please review and integrate the patch?

Below is high level description of changes in TensorBroadcasting.h,

1. Implemented two new âpacketâ functions (packetNByOne, packetOneByN) that optimizes the following N-D tensor broadcast cases for both row and column major ordering. Essentially, these are generalization of a Row/Column vector to a Matrix.

a. NByOne: Input (d0 x d1 x .. x 1) with Bcast (1 x 1 x .. dn) resulting in Output (d0 x d1 x .. dn)

i. This function uses SIMD broadcast instruction where possible and in other cases gathers input elements without index recalculation

b. OneByN: Input (1 x d1 x d2 x ..) with Bcast (d0 x 1 x 1 x ..) resulting in Output (d0 x d1 x d2 x ..)

i. This function uses SIMD load instruction where possible and in other cases gathers input elements without index recalculation

2. Modified the existing packet functions (Packet{Row,Col}Major) to reduce index calculations when input stride is non-SIMD

Testing:
Added the following 4 new test cases to cxx11_tensor_broadcasting.cpp

1. NByOne with row and column major ordering

2. OneByN with row and column major ordering

Performance:
Below are the results of a sweep test with (-O3 + -mavx2 + 4 threads) on an Intel Broadwell machine. Observed ~30x and ~4x speedup for NByOne and OneByN cases respectively. NByOne case show a much higher speed-up because the baseline uses scalar instructions whereas tuned uses SIMD vbroadcasts{s,d}.

NByOne (Inputs: NxNx1 with Bcast: 1x1x32)

OneByN (Inputs: 1xNxN with Bcast: 32xNxN)

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

Output Tensor

Baseline (Time ms)

Tuned (Time ms)

Speedup

64x64x32

2182

92

23.71739

32x64x64

303

87

3.482759

128x128x32

8599

325

26.45846

32x128x128

1184

294

4.027211

192x192x32

19459

722

26.95152

32x192x192

2622

652

4.021472

256x256x32

34056

1163

29.28289

32x256x256

4578

1127

4.062112

320x320x32

53492

1780

30.05169

32x320x320

7112

1675

4.24597

384x384x32

77069

2525

30.52238

32x384x384

10201

2385

4.277149

448x448x32

105110

3400

30.91471

32x448x448

13885

3212

4.322852

512x512x32

136317

4424

30.81307

32x512x512

18074

4186

4.317726

576x576x32

172864

5572

31.02369

32x576x576

22853

5346

4.274785

640x640x32

213730

6970

30.66428

32x640x640

28259

6582

4.293376

704x704x32

258884

8518

30.39258

32x704x704

34248

7926

4.320969

768x768x32

308352

10120

30.46957

32x768x768

40811

9499

4.296347

832x832x32

362219

11947

30.31882

32x832x832

47932

11133

4.305398

896x896x32

420417

13945

30.14823

32x896x896

55564

12929

4.297625

960x960x32

482870

15912

30.34628

32x960x960

63772

14963

4.26198

1024x1024x32

545371

18228

29.91941

32x1024x1024

72404

16998

4.25956

-Vamsi

-Vamsi