[eigen] Re: recent improvements in the products

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]



In case some peoples are interested by high performance on large matrices, I've almost finished to implement fast routines covering the complete BLAS API. Those include:

* fast triangular * general matrix/vector (TRVM / TRMM), e.g.:
  m3 += s1 * m1.adjoint().triangularView<UnitUpperTriangular>() * m2;
  m3 -= s1 * m2 * m1.triangularView<LowerTriangular>();

* fast triangular solver for multiple right hand sides (TRSV / TRSM):
  m1.adjoint().triangularView<UnitUpperTriangular>().solveInPlace(m2);

* fast selfadjoint * general matrix/vector (SYMV / SYMM / HEMM), e.g.:
  m3 += s1 * m1.selfadjointView<UpperTriangular>() * m2;
  m3 -= s1 * m2 * m1.selfadjointView<LowerTriangular>();

* fast rank 1 / rank K update of a selfadjoint matrix (SYR / SYRK), e.g.:
  m1.selfadjointView<UpperTriangular>().rankUpdate(u,s); // m1 += s * u * u^T
  m1.selfadjointView<UpperTriangular>().rankUpdate(m2.adjoint(),s); // m1 += s * m2.adjoint() * m2

* fast rank 2 update of a selfadjoint matrix (SYR2), e.g.:
  m1.selfadjointView<UpperTriangular>().rankUpdate(u,v,s); // m1 += s * u * v^T + s * v * u^T


Some remarks/questions:

* all these new routines are just high level blocking algorithms built on top of a single highly optimized product kernel.

* they all work at full speed in all configurations (storage orders).

* are you ok with the new api to deal with triangular/selfadjoint matrices ?

* do you prefer the shorter triangular<> / selfadjoint<> instead of triangularView<> / selfadjointView<> ?

* are you ok with the rankUpdate() API ? do you have any better idea to expose this ?

* currently we can only solve for  X = M^1 * Y, but we also need X = Y * M^-1. Implementation wise we just have to transpose everything, so actually, the user can probably already do:
   m1.transpose().triangularView<UpperTriangular>(m2.transpose());
 but that's not very nice API wise. The only solution I see is to add an option to TriangularView::solveInPlace(), e.g., solveInPlace(m2, Right);

* these new special products are lazy by default.

good night,
Gael.

On Wed, Jul 8, 2009 at 7:14 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:
Hi,

just an email to keep the list uptodate about the recent progress I made in the matrix-matrix and matrix-vector products. So far only a limited set of expressions (only block, transpose, and operator+=) was efficiently handled by our fast product kernels, for instance:

C = A*B;
C = A.transpose() * B.block(....);
C.col(i) += A * B.row(j).transpose();

This was not bad, but this does not cover all the possibilities offered by the xGEMM BLAS routine. So I recently extended the matrix product logic to be able to catch a large variety of expressions which can be mapped to a single xGEMM call (or something equivalent to xGEMM). All these changes has been done in this branch: http://bitbucket.org/bjacob/eigen2_special_matrices/ to simplify future merging. So here are some examples of expressions which each compiles to a single function call to our own xGEMM-like routine without any temporary:

m3  -= (s1 * m1 * m2).lazy();
m3  = (m1.adjoint() * (s1*m2).conjugate()).lazy
m3 -= ((-m1*s2) * s1*m2.adjoint()).lazy()
v3 += ((-m1.adjoint() * s2) * (s1 * v1.adjoint())).lazy()

legend:
  m = matrix
  v = vector
  s = scalar

In all these expressions we automatically extract a single scalar multiple of the complete _expression_ and tag the rhs and lhs to know if we have to take the complex conjugate of their coefficients bypassing all the nested expressions.

There are still a few limitations though:


The first one is:

m3 = (s * (m1 * m2)).lazy();

which is evaluated as: "m3 = s * (m1 * m2).eval()" that requires 2 evaluations and a temporary.
This is because the rhs matrix product is automatically evaluated by the scalar product _expression_. Therefore it is not possible to catch such an _expression_ this way:

template<typename Lhs, typename Rhs>
MatrixBase::operator=(const CwiseUnaryOp<ei_scalar_multiple<Scalar>, Product<Lhs,Rhs> >& xpr);

it is too late!

So basically the user will to make sure he/her ommits the parentheseses:

m3 = (s * m1 * m2);

Likewise:

m3 = m3 - m1 * m2;

have to be written using -=:

m3 -= (m1 * m2).lazy();

(moreover we cannot detect at compile  that the two "m3' are indeed the same matrices)



The second limitation is in expressions such as:

m3 = ((m1*s1).transpose() * m2).lazy();

which is evaluated as:

m3 = (((m1*s1).transpose()).eval() * m2).lazy();

This is because the analyze of the product factors stops at the first Transpose(), and therefore it only sees an _expression_ which have to be evaluated before we can perform the product.



Another thing is that the product order is not respected. For instance:

m3 -= ((s1 * m1).conjugate() * (-m2*s2)).lazy();

is transformed into:

m3 += (( (-1) * ei_conj(s1) * (-1) * s2 ) * (m1.conjugate() * m2)).lazy()

However I don't think that's big deal, because whatsoever the floating point operations in the matrix product are already performed in an unspecified order.

That's it.

gael.



--
Gaël Guennebaud
Iparla - INRIA Bordeaux
(+33)5 40 00 37 95


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/