[eigen] Eigen 2 to Eigen 3 performance regressions with mapped matrices

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I've attached a microbenchmark that is similar in spirit to what we are doing with Eigen, that illustrates slowdown from Eigen 2 to Eigen 3. In particular, the benchmark does y += A*x, for A, x, y mapped unaligned dynamic but small dimension matrices. It could be that I have not chosen appropriate compiler flags. I am seeing performance 2x to 3x worse. Take a look at the header comments in the attached benchmark for more numbers.

Keir
// Copyright 2012 Google Inc. All Rights Reserved.
// Author: keir@xxxxxxxxxx (Keir Mierle)
//
// I compiled this benchmark with GCC 4.4.3-4ubuntu5, on Ubuntu 10.04 stock
// GCC, using both Eigen 2, Eigen 3 head (currently 3.1.x), and Eigen 3.0.4.
// The results for Eigen 3.0.4 and head were the same. Compile flags:
//
//   g++ -DV2 -O2 small_matrix_products.cc -o small_matrix_products_v2 -I src/eigen-2.0.17
//   g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I src/eigen-3.0.4
//   g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I wrk/eigen
//
// Then I timed several configurations; the results below are for Eigen 3 head:
//
//   for x in 2 3; do; time "./small_matrix_products_v$x" 2 3 10000000; done
//   for x in 2 3; do; time "./small_matrix_products_v$x" 2 6 10000000; done
//   for x in 2 3; do; time "./small_matrix_products_v$x" 2 9 10000000; done
//   ...
//
// On my desktop, which is a 4-core Xeon X5550 (non-Sandybridge Core i7), Eigen
// 3 gets beaten all over the playground by Eigen 2 in this benchmark.
//
//   N    M   Iterations   V2 (sec)   V3 (sec)
//    2   3   10000000     0.306      1.293
//    2   6   10000000     0.363      1.384
//    2   9   10000000     0.424      1.489
//    2  16   10000000     0.564      1.808
//    2  32   10000000     0.901      2.492
//    3   2   10000000     0.289      1.341
//    6   2   10000000     0.424      1.366
//    9   2   10000000     0.490      1.490
//   16   2   10000000     0.751      1.734
//   32   2   10000000     1.283      2.316
//   16  16   10000000     3.322      5.051
//   32  32   10000000    12.480     16.167
//  256 256       5000     0.656      0.638  First time eigen3's faster!
//  512 512       5000     2.672      2.706
//
// In our case we can't use fixed matrices because the sizes aren't known at
// compile time. The ones of particular interest to us are 2x3, 2x9, 9x9, and
// their transposes, though we need to handle potentially larger matrices.
//
// I also accidentally made the table above without the ".lazy()" and
// ".noalias()" modifiers on the product expression. In that case, I found for
// the 256x256 case, eigen2 was 2x faster. With noalias and lazy, the times are
// comparable.

#include <cstdlib>
#include <iostream>
#include "Eigen/Core"

using namespace Eigen;
using namespace std;

// For our use case, we have many matrices of different sizes packed into one
// big array. The packed matrices are often, though not always, small (2x3,
// etc). The alignment of the submatrices is arbitrary.
const int num_packed_coeffs = 1024*1024;
double packed_matrices[num_packed_coeffs];

// The microbenchmark, which measures the performance of y += A * x on mapped
// vectors and matrices. y is N x 1, A is N x M, x is M x 1.
void BenchSmallProduct(int n, int m, int num_iterations) {
  // Our matrices are all packed inside other arrays, so alignment is random.
  // Here the matrices and vectors are back-to-back, but this is not always the
  // case in our production code.
  Map<Matrix<double, Dynamic, Dynamic> > A(packed_matrices + 105, n, m);
  Map<Matrix<double, Dynamic, Dynamic> > y(packed_matrices + 105 + n * m, n, 1);
  Map<Matrix<double, Dynamic, Dynamic> > x(packed_matrices + 105 + n * m + n, m, 1);
  for (int i = 0; i < num_iterations; i++) {
#if defined(V2)
    y += (A * x).lazy();
#elif defined(V3)
    y.noalias() += A * x;
#endif
  }
  // Print the result to prevent the benchmark from getting optimized out.
  cout << "Result: " << y.transpose() << endl;
}

int main(int argc, char **argv) {
  int n = atoi(argv[1]);
  int m = atoi(argv[2]);
  int num_iterations = atoi(argv[3]);

  // Put some initial non-random values into the "packed matrices."
  for (int i = 0; i < num_packed_coeffs; i++) {
    packed_matrices[i] = 2.0 * i / num_packed_coeffs;
  }

  BenchSmallProduct(n, m, num_iterations);
  return 0;
}


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/