[eigen] Two performance regressions from Eigen2 to Eigen3 with bisected

What's especially interesting is the first regression. The change is simply changing a == to a <=, but it triples the benchmark time.

# This is doing y += A * x; for y 2x1, A 2x3, x 3x1, but dynamically sized; see

# the attached benchmark.

# Eigen 2.0-rc1: 0.59s

# Eigen 3.1.x: 1.44s

# I want to find the slowdown! I ran bisect.

# At rev 2818:46717fcc1338 I got 0.277 seconds, which is nearly 2x

# faster than Eigen 2.0rc-1!

# 2818:46717fcc1338 -> 0.277

# 2867:b1513faeb8c8 -> 0.268

# 2889:10de0fab4b7c -> 1.190 <- Not 1.4s, but way slower

# 2869:cd2eedbb4b9a -> 0.268

# 2870:57403e3ebd77 -> 0.268

# 2871:fa29f384b58e -> 1.120

# FIRST REGRESSION: 0.277s to 1.19s

# changeset: 2871:fa29f384b58e

# user: Gael Guennebaud <g.gael@xxxxxxx>

# date: Tue Jun 01 22:51:47 2010 +0200

# summary: fix issue #128 : inner stride can also be 0 in which case it means 1...

# 2871:fa29f384b58e -> 1..120s

# 2870:57403e3ebd77 -> 0.268s

# SECOND REGRESSION: 1.19s to 1.350s

# changeset: 3257:16fbf706b00b

# user: Gael Guennebaud <g.gael@xxxxxxx>

# date: Sun Jul 11 15:48:30 2010 +0200

# summary: mixing types in product step 2:

# 3256:4eb3074535b1 -> 1.190s

# 3257:16fbf706b00b -> 1.380s

# THIRD REGRESSION: 1.350 to 1.420

# changeset: 3259:1835e135c596

# user: Gael Guennebaud <g.gael@xxxxxxx>

# date: Sun Jul 11 23:57:23 2010 +0200

# summary: mixing types step 3:

# 3259:1835e135c596 -> 1.420s

# 3258:39900c3d91f0 -> 1.380s

// Copyright 2012 Google Inc. All Rights Reserved. // Author: keir@xxxxxxxxxx (Keir Mierle) // // Compile with GCC 4.4.3-4ubuntu5: // // g++ -DV2 -O2 small_matrix_products.cc -o small_matrix_products_v2 -I src/eigen-2.0.17 // g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I src/eigen-3.0.4 // g++ -DV3 -O2 small_matrix_products.cc -o small_matrix_products_v3 -I wrk/eigen // // Then try a few combinations: // // for x in 2 3; do; time "./small_matrix_products_v$x" 2 9 10000000; done // for x in 2 3; do; time "./small_matrix_products_v$x" 2 3 10000000; done // for x in 2 3; do; time "./small_matrix_products_v$x" 2 3 10000000; done // // On my desktop, which is a 4-core Xeon X5550, Eigen 3 gets beaten all over the // playground by Eigen 2 in this benchmark, for all tried values of N and M: // // N M Iterations V2 (sec) V3 (sec) // 2 3 10000000 0.306 1.293 // 2 6 10000000 0.363 1.384 // 2 9 10000000 0.424 1.489 // 2 16 10000000 0.564 1.808 // 2 32 10000000 0.901 2.492 // 3 2 10000000 0.289 1.341 // 6 2 10000000 0.424 1.366 // 9 2 10000000 0.490 1.490 // 16 2 10000000 0.751 1.734 // 32 2 10000000 1.283 2.316 // 16 16 10000000 3.322 5.051 // 32 32 10000000 12.480 16.167 // 256 256 5000 0.656 0.638 First time eigen3's faster! Without // 512 512 5000 2.672 2.706 Similar times again. // // In our case we can't use fixed size matrices because the size aren't known at // compile time. The ones of particular interest to use are 2x3, 2x9, 9x9, and // their transposes. // // I also accidentally made the table above without the ".lazy()" and // ".noalias()". In that case, what I found interesting was that for the 256x256 // case, eigen2 was 2x faster. With noalias and lazy, the times are comparable. #include <cstdlib> #include <iostream> #include "Eigen/Core" using namespace Eigen; using namespace std; const int num_packed_coeffs = 1024*1024; double packed_matrices[num_packed_coeffs]; // y += A * x; // y is N x 1, A is N x M, x is M x 1 void BenchSmallProduct(int n, int m, int num_iterations) { // Our matrices are all packed inside other arrays, so alignment is random. Map<Matrix<double, Dynamic, Dynamic> > A(packed_matrices + 105, n, m); Map<Matrix<double, Dynamic, 1> > y(packed_matrices + 105 + n * m, n, 1); Map<Matrix<double, Dynamic, 1> > x(packed_matrices + 105 + n * m + n, m, 1); for (int i = 0; i < num_iterations; i++) { #if EIGEN_WORLD_VERSION == 2 //y += A * x; //y += (A * x).lazy(); y.noalias() += A * x; #elif EIGEN_WORLD_VERSION == 3 y.noalias() += A * x; #else #error "Something weird happened." #endif } // Final print to prevent this from getting optimized out. cout << "Result: " << y.transpose() << endl; cout << "World version: " << EIGEN_WORLD_VERSION << endl; } int main(int argc, char **argv) { int n = atoi(argv[1]); int m = atoi(argv[2]); int num_iterations = atoi(argv[3]); for (int i = 0; i < num_packed_coeffs; i++) { packed_matrices[i] = 2.0 * i / num_packed_coeffs; } BenchSmallProduct(n, m, num_iterations); return 0; }