[eigen] Two performance regressions from Eigen2 to Eigen3 with bisected changes

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I bisected a specific case to find the performance regressions between 2.0-rc1 and 3.0.4. I used the attached benchmark with the attached script for 2x3 sized matrices and 2x1 vectors and 3x1 vectors. I found 3 regressions. Below is the pasted text from the script and benchmark I used to bisect.

What's especially interesting is the first regression. The change is simply changing a == to a <=, but it triples the benchmark time.

# This is doing y += A * x; for y 2x1, A 2x3, x 3x1, but dynamically sized; see
# the attached benchmark.
#
# Eigen 2.0-rc1:   0.59s
# Eigen 3.1.x:     1.44s
#
# I want to find the slowdown! I ran bisect.
#
# At rev 2818:46717fcc1338 I got 0.277 seconds, which is nearly 2x
# faster than Eigen 2.0rc-1!
#
# 2818:46717fcc1338 -> 0.277
# 2867:b1513faeb8c8 -> 0.268
# 2889:10de0fab4b7c -> 1.190    <- Not 1.4s, but way slower
# 2869:cd2eedbb4b9a -> 0.268
# 2870:57403e3ebd77 -> 0.268
# 2871:fa29f384b58e -> 1.120

# FIRST REGRESSION: 0.277s to 1.19s
#
# changeset:   2871:fa29f384b58e
# user:        Gael Guennebaud <g.gael@xxxxxxx>
# date:        Tue Jun 01 22:51:47 2010 +0200
# summary:     fix issue #128 : inner stride can also be 0 in which case it means 1...
#
# 2871:fa29f384b58e -> 1..120s
# 2870:57403e3ebd77 -> 0.268s

# SECOND REGRESSION: 1.19s to 1.350s
#
# changeset:   3257:16fbf706b00b
# user:        Gael Guennebaud <g.gael@xxxxxxx>
# date:        Sun Jul 11 15:48:30 2010 +0200
# summary:     mixing types in  product step 2:
#
# 3256:4eb3074535b1 -> 1.190s
# 3257:16fbf706b00b -> 1.380s

# THIRD REGRESSION: 1.350 to 1.420
#
# changeset:   3259:1835e135c596
# user:        Gael Guennebaud <g.gael@xxxxxxx>
# date:        Sun Jul 11 23:57:23 2010 +0200
# summary:     mixing types step 3:
#
# 3259:1835e135c596 -> 1.420s
# 3258:39900c3d91f0 -> 1.380s

Let me know if this provides more insight into the performance regressions.

Keir
// Copyright 2012 Google Inc. All Rights Reserved.
// Author: keir@xxxxxxxxxx (Keir Mierle)
//
// Compile with GCC 4.4.3-4ubuntu5:
//
//   g++ -DV2 -O2 small_matrix_products.cc -o small_matrix_products_v2 -I src/eigen-2.0.17
//   g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I src/eigen-3.0.4
//   g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I wrk/eigen
//
// Then try a few combinations:
//
//   for x in 2 3; do; time "./small_matrix_products_v$x" 2 9 10000000; done
//   for x in 2 3; do; time "./small_matrix_products_v$x" 2 3 10000000; done
//   for x in 2 3; do; time "./small_matrix_products_v$x" 2 3 10000000; done
//
// On my desktop, which is a 4-core Xeon X5550, Eigen 3 gets beaten all over the
// playground by Eigen 2 in this benchmark, for all tried values of N and M:
//
//   N    M   Iterations   V2 (sec)   V3 (sec)
//    2   3   10000000     0.306      1.293
//    2   6   10000000     0.363      1.384
//    2   9   10000000     0.424      1.489
//    2  16   10000000     0.564      1.808
//    2  32   10000000     0.901      2.492
//    3   2   10000000     0.289      1.341
//    6   2   10000000     0.424      1.366
//    9   2   10000000     0.490      1.490
//   16   2   10000000     0.751      1.734
//   32   2   10000000     1.283      2.316
//   16  16   10000000     3.322      5.051
//   32  32   10000000    12.480     16.167
//  256 256       5000     0.656      0.638  First time eigen3's faster! Without
//  512 512       5000     2.672      2.706  Similar times again.
//
// In our case we can't use fixed size matrices because the size aren't known at
// compile time. The ones of particular interest to use are 2x3, 2x9, 9x9, and
// their transposes.
//
// I also accidentally made the table above without the ".lazy()" and
// ".noalias()". In that case, what I found interesting was that for the 256x256
// case, eigen2 was 2x faster. With noalias and lazy, the times are comparable.

#include <cstdlib>
#include <iostream>
#include "Eigen/Core"

using namespace Eigen;
using namespace std;

const int num_packed_coeffs = 1024*1024;
double packed_matrices[num_packed_coeffs];

// y += A * x;
// y is N x 1, A is N x M, x is M x 1
void BenchSmallProduct(int n, int m, int num_iterations) {
  // Our matrices are all packed inside other arrays, so alignment is random.
  Map<Matrix<double, Dynamic, Dynamic> > A(packed_matrices + 105, n, m);
  Map<Matrix<double, Dynamic, 1> > y(packed_matrices + 105 + n * m, n, 1);
  Map<Matrix<double, Dynamic, 1> > x(packed_matrices + 105 + n * m + n, m, 1);
  for (int i = 0; i < num_iterations; i++) {
#if EIGEN_WORLD_VERSION == 2
    //y += A * x;
    //y += (A * x).lazy();
    y.noalias() += A * x;
#elif EIGEN_WORLD_VERSION == 3
    y.noalias() += A * x;
#else
#error "Something weird happened."
#endif
  }
  // Final print to prevent this from getting optimized out.
  cout << "Result: " << y.transpose() << endl;
  cout << "World version: " << EIGEN_WORLD_VERSION << endl;
}

int main(int argc, char **argv) {
  int n = atoi(argv[1]);
  int m = atoi(argv[2]);
  int num_iterations = atoi(argv[3]);

  for (int i = 0; i < num_packed_coeffs; i++) {
    packed_matrices[i] = 2.0 * i / num_packed_coeffs;
  }

  BenchSmallProduct(n, m, num_iterations);
  return 0;
}

Attachment: small_matrix_products.sh
Description: Bourne shell script



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/