[eigen] Eigen 2 to Eigen 3 performance regressions with mapped matrices |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
- Subject: [eigen] Eigen 2 to Eigen 3 performance regressions with mapped matrices
- From: Keir Mierle <mierle@xxxxxxxxx>
- Date: Tue, 10 Jan 2012 19:58:46 -0800
- Cc: Sameer Agarwal <sameeragarwal@xxxxxxxxxx>, tucker@xxxxxxxxxx
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:cc:content-type; bh=Ngc8HfY/Y8xAmeHj/QMRMiTgQ9fAi0kdCNcgkrtJWoY=; b=kWET/0QGt9IiIC3Nw1Kuzbtjn3IPvwVDetz/YjiDjah+rN/AeUmmzqYcwKIgjQQTYn q+b9AuRBPJ0+UH+um/L0SvH3mDn5jT3sTkxqSyWbXmelUXKQPBM02uodVnOiVlhOBv9F y5AzcrnouCZ+auJxLbrDHDh2tveK6CglqFNDo=
I've attached a microbenchmark that is similar in spirit to what we are doing with Eigen, that illustrates slowdown from Eigen 2 to Eigen 3. In particular, the benchmark does y += A*x, for A, x, y mapped unaligned dynamic but small dimension matrices. It could be that I have not chosen appropriate compiler flags. I am seeing performance 2x to 3x worse. Take a look at the header comments in the attached benchmark for more numbers.
Keir
// Copyright 2012 Google Inc. All Rights Reserved.
// Author: keir@xxxxxxxxxx (Keir Mierle)
//
// I compiled this benchmark with GCC 4.4.3-4ubuntu5, on Ubuntu 10.04 stock
// GCC, using both Eigen 2, Eigen 3 head (currently 3.1.x), and Eigen 3.0.4.
// The results for Eigen 3.0.4 and head were the same. Compile flags:
//
// g++ -DV2 -O2 small_matrix_products.cc -o small_matrix_products_v2 -I src/eigen-2.0.17
// g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I src/eigen-3.0.4
// g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I wrk/eigen
//
// Then I timed several configurations; the results below are for Eigen 3 head:
//
// for x in 2 3; do; time "./small_matrix_products_v$x" 2 3 10000000; done
// for x in 2 3; do; time "./small_matrix_products_v$x" 2 6 10000000; done
// for x in 2 3; do; time "./small_matrix_products_v$x" 2 9 10000000; done
// ...
//
// On my desktop, which is a 4-core Xeon X5550 (non-Sandybridge Core i7), Eigen
// 3 gets beaten all over the playground by Eigen 2 in this benchmark.
//
// N M Iterations V2 (sec) V3 (sec)
// 2 3 10000000 0.306 1.293
// 2 6 10000000 0.363 1.384
// 2 9 10000000 0.424 1.489
// 2 16 10000000 0.564 1.808
// 2 32 10000000 0.901 2.492
// 3 2 10000000 0.289 1.341
// 6 2 10000000 0.424 1.366
// 9 2 10000000 0.490 1.490
// 16 2 10000000 0.751 1.734
// 32 2 10000000 1.283 2.316
// 16 16 10000000 3.322 5.051
// 32 32 10000000 12.480 16.167
// 256 256 5000 0.656 0.638 First time eigen3's faster!
// 512 512 5000 2.672 2.706
//
// In our case we can't use fixed matrices because the sizes aren't known at
// compile time. The ones of particular interest to us are 2x3, 2x9, 9x9, and
// their transposes, though we need to handle potentially larger matrices.
//
// I also accidentally made the table above without the ".lazy()" and
// ".noalias()" modifiers on the product expression. In that case, I found for
// the 256x256 case, eigen2 was 2x faster. With noalias and lazy, the times are
// comparable.
#include <cstdlib>
#include <iostream>
#include "Eigen/Core"
using namespace Eigen;
using namespace std;
// For our use case, we have many matrices of different sizes packed into one
// big array. The packed matrices are often, though not always, small (2x3,
// etc). The alignment of the submatrices is arbitrary.
const int num_packed_coeffs = 1024*1024;
double packed_matrices[num_packed_coeffs];
// The microbenchmark, which measures the performance of y += A * x on mapped
// vectors and matrices. y is N x 1, A is N x M, x is M x 1.
void BenchSmallProduct(int n, int m, int num_iterations) {
// Our matrices are all packed inside other arrays, so alignment is random.
// Here the matrices and vectors are back-to-back, but this is not always the
// case in our production code.
Map<Matrix<double, Dynamic, Dynamic> > A(packed_matrices + 105, n, m);
Map<Matrix<double, Dynamic, Dynamic> > y(packed_matrices + 105 + n * m, n, 1);
Map<Matrix<double, Dynamic, Dynamic> > x(packed_matrices + 105 + n * m + n, m, 1);
for (int i = 0; i < num_iterations; i++) {
#if defined(V2)
y += (A * x).lazy();
#elif defined(V3)
y.noalias() += A * x;
#endif
}
// Print the result to prevent the benchmark from getting optimized out.
cout << "Result: " << y.transpose() << endl;
}
int main(int argc, char **argv) {
int n = atoi(argv[1]);
int m = atoi(argv[2]);
int num_iterations = atoi(argv[3]);
// Put some initial non-random values into the "packed matrices."
for (int i = 0; i < num_packed_coeffs; i++) {
packed_matrices[i] = 2.0 * i / num_packed_coeffs;
}
BenchSmallProduct(n, m, num_iterations);
return 0;
}