Re: [eigen] cache-friendly matrix inverse

[ Thread Index | Date Index | More Archives ]

On Thu, 14 May 2009, Benoit Jacob wrote:

So, we agree that multiplying by the inverse will be faster for small
enough sizes, and that LU solving will be faster for large enough
sizes, and the only question is what the crossover point is.

The number of floating point operations is exactly the same for both strategies. But after thinking it over, I have to admit that that's not the whole story. The index arithmetic is more difficult when doing an LU solve. You also have to undo the permutation. And then you indeed have hardware issues, as you explain:

I don't have any idea about that to be honest, and since we're talking
about small sizes here it'll even depend on e.g. whether the size of a
column (assuming column-major storage) is a multiple of the SIMD
vector size, etc... so one can't give a very general answer. About
testing with Eigen, well since we're talking about small sizes
unrolling will be very important, currently the matrix-vector product
is unrolled but not the triangular solver so i don't think this will
produce useful numbers.

I tested anyway. Indeed the inverse is massively faster for small matrices. But it's also massively faster for small dynamic-size matrices (there is no loop unrolling in this case, is there?). And it's still clearly faster for 100 x 100 matrices.

One important caveat: I did not test for accuracy. Using the LU decomposition is supposed to be more stable.

I can see two possibilities. Either the perceived wisdom in numerical analysis is wrong. Or there is a lot of scope to optimize LU solve.

The timings are below. I attach the code. I compiled with GCC 4.3.3 with -O2 -DNDEBUG -msse2 -msse3 -mssse3 on an Intel Core2 computer.

Matrix3f   , 50000000 solves. inverse: 0.020 s   LU: 7.010 s   PLU: 3.950 s
Matrix3d   , 50000000 solves. inverse: 0.060 s   LU: 7.500 s   PLU: 5.650 s
MatrixXf(3), 50000000 solves. inverse: 1.370 s   LU: 15.580 s   PLU: 6.320 s
MatrixXd(3), 50000000 solves. inverse: 1.500 s   LU: 16.130 s   PLU: 6.570 s
Matrix4f   , 30000000 solves. inverse: 0.070 s   LU: 5.910 s   PLU: 4.720 s
Matrix4d   , 30000000 solves. inverse: 0.140 s   LU: 6.240 s   PLU: 4.310 s
MatrixXf(4), 30000000 solves. inverse: 1.090 s   LU: 10.900 s   PLU: 5.450 s
MatrixXd(4), 30000000 solves. inverse: 0.980 s   LU: 10.980 s   PLU: 5.420 s
MatrixXf(100), 200000 solves. inverse: 0.810 s   LU: 2.360 s   PLU: 2.070 s
MatrixXd(100), 200000 solves. inverse: 1.750 s   LU: 3.370 s   PLU: 2.970 s

One final note. PartialLU<Matrix3f>::solve() produces the following warning (twice):

..../Eigen/src/Core/SolveTriangular.h: In member function void Eigen::MatrixBase<Derived>::solveTriangularInPlace(const Eigen::MatrixBase<OtherDerived>&) const [with OtherDerived = Eigen::Matrix<float, 3, 1, 0, 3, 1>, Derived = Eigen::Flagged<Eigen::Matrix<float, 3, 3, 0, 3, 3>, 1024u, 0u>]:
..../Eigen/src/Core/SolveTriangular.h:189: warning: array subscript is above array bounds

I did not investigate further.

#include <boost/timer.hpp>
#include <Eigen/Array>
#include <Eigen/Core>
#include <Eigen/LU>
#include <string>

// import most common Eigen types 

template <typename TM, typename TV>
void benchmark(int repeats, const char* name, int size = 0)
  TM A;
  TV b, x;
  if (size == 0) {
    // fixed-size matrices
    A = TM::Random();
    b = TV::Random();
  } else {
    // dynamic-size matrices
    A = TM::Random(size,size);
    b = TV::Random(size);

  TM Ainv = A.inverse();
  Eigen::LU<TM> lu(A);
  Eigen::PartialLU<TM> plu(A);

  float result;
  boost::timer tm;

  for (int i=0; i<repeats; i++)
    x = (Ainv*b).lazy();
  result = tm.elapsed();
  printf("%s, %d solves. inverse: %.3f s", name, repeats, result);
  for (int i=0; i<repeats; i++)
    lu.solve(b, &x);
  result = tm.elapsed();
  printf("   LU: %.3f s", result);

  for (int i=0; i<repeats; i++)
    plu.solve(b, &x);
  result = tm.elapsed();
  printf("   PLU: %.3f s\n", result);


int main(int, char *[])
  benchmark<Matrix3f, Vector3f>(50000000, "Matrix3f   ");
  benchmark<Matrix3d, Vector3d>(50000000, "Matrix3d   ");
  benchmark<MatrixXf, VectorXf>(50000000, "MatrixXf(3)", 3);
  benchmark<MatrixXd, VectorXd>(50000000, "MatrixXd(3)", 3);
  benchmark<Matrix4f, Vector4f>(30000000, "Matrix4f   ");
  benchmark<Matrix4d, Vector4d>(30000000, "Matrix4d   ");
  benchmark<MatrixXf, VectorXf>(30000000, "MatrixXf(4)", 4);
  benchmark<MatrixXd, VectorXd>(30000000, "MatrixXd(4)", 4);
  benchmark<MatrixXf, VectorXf>(200000, "MatrixXf(100)", 100);
  benchmark<MatrixXd, VectorXd>(200000, "MatrixXd(100)", 100);

Mail converted by MHonArc 2.6.19+