RE: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

Hi all,
Here is the patch v2 for Intel® MKL IE SpBLAS support in Eigen library. Modifications include implementation of MKLSparseHandle class which is a wrapper class around the IE SpBLAS sparse_handle and support of SparseMatrix*DenseVector = DenseVector operation based on IE SpBLAS.

Details and example:
MKLSparseHandle will become a new member of Eigen’s SparseMatrix if the MKLSpBLASSupport module is included at build time. This will help to handle memory management calls to mkl_sparse_?_create_csr/csc and mkl_sparse_destroy. Also, this allows us to track if the call to mkl_sparse_optimize is required.

Note that Intel® MKL IE SpBLAS support many different hints related to operations on the matrix (non-transpose, transpose, conjugate-transpose), operations to be performed (mv, mm, trsv, etc.), matrix structure (general, symmetric, triangular, etc.), diagonal values of matrix (unit or non-unit) and how many times we expect to call these operations once optimized. This information should be provided before call to mkl_sparse_optimize routine. To handle this, the functions sparseAnalyze(…) and sparseOptimize() were implemented which can be called prior to matrix-vector multiplication in order to perform all necessary optimizations, otherwise the default IE SpBLAS kernels will be used.

Let me show you the updates on simple matrix-vector multiplication example:

/* Initialize matrix and convert it to compatible CSR or CSC format */
SparseMatrix<DATA_TYPE, SPARSE_LAYOUT> A;
A.makeCompressed();

/* Set hint to optimize for 10 iterations of A.triangularView<Lower>().transpose()*y=x */

A.sparseAnalyze(A.triangularView<Lower>().transpose(), 10);

/* Another way to set all necessary hints at once.
   Following code is used to give hints to optimize for
   100 iterations of A.triangularView<UnitUpper>().transpose()*y=x and
   20 iterations of A.transpose()*y=x */
MKLSparseHint Hint;

Hint.A_exp_calls = 20;

Hint.A_TriangularView_Upper_UnitDiag_exp_calls = 100;

Hint.IsTransposeRequired = 1;

A.sparseAnalyze(Hint);

/* Additionally, one can call sparseOptimize to perform all optimizations before calling matrix-vector

, otherwise it will be performed within the first call */

A.sparseOptimize();

/* Some computations */
x = beta*x + alpha*A*y;

Remarks:

1. If there were no calls to sparseAnalyze(…) or sparseOptimize() routines, then sparse_handle will be created on-the-fly with the first call of matrix-vector multiplication.

2. If for some reasons IE SpBLAS failed to create sparse_handle or perform operation then built-in Eigen matrix-vector kernels will be used.
Here I have a question for you regarding error handling for our IE SpBLAS functionality. It is possible to catch memory allocation failures and call the proper Eigen error exception throw_std_bad_alloc().
But another kind of failures can also happens (we return a sparse_status from our operations that can be checked upon completion), so would it be a good idea to return a print message saying that SpBLAS encountered the following error and exit without going to default kernels for debug Eigen build? And for release mode we could catch IE SpBLAS error and reroute to the default Eigen kernels, and maybe give a message somehow saying what happened? Is this something you support? Or could support? What do you recommend for this?

Please, let me know what you think about the current solution for bringing IE SpBLAS support to Eigen library.
Any comments and suggestions are greatly appreciated!

Best regards,
Maria

From: Zhukova, Maria [mailto:maria.zhukova@xxxxxxxxx]
Sent: Wednesday, April 11, 2018 10:41 AM
To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: RE: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

Hello Gael,

Thanks for your comments!
I confirm that for mentioned matrices call to mkl_sparse_optimize() can be a noticeable performance overhead.
So, we’ve decided to revisit our implementation considering all your suggestions and remarks. I will get back to you as soon as I finish new patch.

Best regards,
Maria

From: Gael Guennebaud [mailto:gael.guennebaud@xxxxxxxxx]
Sent: Friday, April 6, 2018 4:03 PM
To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

Hi,

to answer your questions, let me clarify that I'm using the 2017.5.220 version of MKL bcause the 2018 version is not compatible with my OS (OSX 10.11). The CPU is a quadcore-Haswell (I7-4960HQ).

I tried various symmetric matrices of about 1M x 1M with 5M to 15M nonzeros, but also some from the suitesparse collection such as 'bmw3_2'. To be fair, some are also significantly faster (like x2) when using MKL's IE SparseBlas, for instance 'pkustk14', but again only if doing a dozen of products, otherwise calling mkl_sparse_optimize is counter productive.

On Fri, Apr 6, 2018 at 8:49 PM, Kalinkin, Alexander A <alexander.a.kalinkin@xxxxxxxxx> wrote:

[akalinki] This can be an issue for an older version of MKL, because initially we spent additional time on doing hints/optimize when calling this every time. Starting from MKL2018u3 we’re planning to remove this overhead – if you want we can share eng. build to check performance on same binaries.

Also about poor performance of SparseBlas matvec functionality in comparison with Eigen built-in one: all these cases we interpret as issue that need to be investigated and should be fixed on our side. So I don’t think that we need to add any dispatch on Eigen level between built-in kernels and ours – all this stuff should be hidden in MKL libraries.

I have no doubt that SparseBlas can always beat or be as fast as Eigen without calling mkl_sparse_optimize, but I have strong doubts that calling mkl_sparse_optimize with expected_calls=10 even if the matrix will be used only once for the given operation can come with neglectible overhead.

And back to the main question of this topic – we have customers who use Eigen on Intel processors and are interested in optimizing their code. Implementation of new SparseMatrix type will forced them to rewrite the code, that sometimes can be impossible due to different reasons. That’s why we are asking to find the way how we can just extend current SparseMatrix type to support MKL functionality.

I agree that ideally it would be nice to have the same level of transparency as we have for dense BLAS, but since here you have to attach/store additional information for the matrices and (IMO) have to control for which operations/matrices mkl_sparse_optimize has to be called, this seems hardly doable.

If you want to stick with your approach of calling mkl_sparse_optimize unconditionally and store the required information within SparseMatrix, then you can still ship a header-only "module" that does so without touching to Eigen. Your current proof-of-concept is actually very close to that. You need to:

- inject data-members and some accessors to SparseCompressedBase through the currently available plugin mechanism,

- access them through free-functions with overloads for Transpose<> (better than injecting code to Transpose<>),

- sparse handles can then be created on demand (the first time you need one),

- and they can be automatically deleted at the destruction time of the matrix (to this end you have to wrap the handle within a tiny class that will become a member of SparseCompressedBase and call mkl_sparse_destroy in the destructor of this wrapper).

The main problem with this approach is that if someone modifies a SparseMatrix after having done some operations on it, then the sparse-handle becomes invalid, and you would need to ask the user to update its code to explicitly invalidate the sparse-handle at adequate places. This issue is also present in your proof-of-concept.

gael

Thanks,

Alexander Kalinkin,

PhD, Team leader, Architect or of Sparse components in Intel®MKL

From: Gael Guennebaud [mailto:gael.guennebaud@xxxxxxxxx]
Sent: Friday, April 6, 2018 5:06 AM

To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

I also tried:

vec1_dense.noalias() += column_major_sparse * vec2_dense

Since this operation is not supported by mkl_sparse_optimize, the overhead is null here, but for some very sparse matrices (5 nnz/col) mkl_sparse_d_mv is about x2 slower than Eigen, while for denser matrices both implementations achieve the same speed.

gael.

On Fri, Apr 6, 2018 at 1:42 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

I just tried your patch with:

vec1_dense.noalias() += row_major_sparse * vec2_dense

and I observe a very significant drop of performance (almost x10) compared to pure Eigen (with -fopenmp) if this operation is performed only once, or very few times... Changing the 'expected_calls' parameter from 10 to 1 does not reduce the performance drop due to mkl_sparse_optimize. If I repeat this operation hundreds of times, then both Eigen and MKL achieve the same level of performance for this particular operation and the matrices I tried (1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).

So I still believe that silently and unconditionally falling back to IE SpBLAS is not the right approach. It could be that calling MKL's SpBLAS is never a loss (we have to bench), but that's clearly not the case of the optimize step. So, clearly, the optimize step must be explicitly activated on user selected expressions only.

gael

Best regards,

Maria

From: Gael Guennebaud [mailto:gael.guennebaud@xxxxxxxxx]
Sent: Thursday, April 5, 2018 2:54 PM
To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:

On 2018-04-05 15:31, Edward Lam wrote:

Would it be useful to incorporate lambda's into the interface to avoid begin/end pairs? So from the user side, we would write code like this instead:

1) Analyze and run

     SimplicialLDLT<MklSparseMatrix<double>> llt(A);
     int it = 0;
     for (int it = 0; ...; ++it) {
         if (it == 0)
             llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
         else
             llt.solve(b);
         // ...
     }

2) Analyze only

     SimplicialLDLT<MklSparseMatrix<double>> llt(A);
     llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
     // and solve as usual

For more complicated algorithms, one can always outline the lambda and pass it into the analysis.

That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls might not get called, because it is masked by a wrong if-condition, or due to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.

The advantages of the begin/end pair are that 1) this keeps the API minimal, and 2) the user can easily wrap them to accommodate its taste and usages. To avoid all sorts of redundancies, one could even write a free function on top the begin/end pair to do:

while(...) {

optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());

}

and with variadic templates:

while(...) {

optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.matrixL(), LU.matrixU());

}

Actually, this last example is not conceivable at the moment because the factors in SparseLU are not stored in a standard compressed format, we would need to add an option to tell SparseLU to turn them to regular SparseMatrix and to use them for solving.

Anyways, all these variants are essentially syntactic sugar/implementation details, and at this stage I'm more concerned about possible shortcomings in the general approach. Is it flexible enough? I'm also not fan of introducing a MklSparseMatrix inheriting SparseMatrix, but I don't have a better offer for now that is generic and easily applicable to decompositions.

gael

However, if this is required, I would suggest to add this method not only to matrices but also to the decomposition (and pass it through to the internal matrices). Otherwise, this does not scale for decompositions which use more than one matrix (like SparseLU). And we could even let the sparseAnalyze() functions return a proxy to the decomposition which would allow writing something like:

x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});

or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`

The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an operation (after how many iterations do you actually benefit from the overhead of analyzing the operation?)

Christoph

Cheers,
-Edward

On 4/5/2018 8:01 AM, Gael Guennebaud wrote:

Thank you for opening this discussion on the public mailing list.

So let's discuss about the public API, which currently is not very convenient as already noticed by others. Issues are:

(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very generic.
   - We need a way to control:
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.

In order to discuss these issues, let's consider the following typical pattern: (e.g., non-linear optimization, eigenvalues, ...)

SimplicialLDLT<SparseMatrix<double> > llt(A);

while(...) {
   ...
x = llt.solve(b);
...
}

Here the triangular L factor is going to be used for triangular and transposed-triangular solves dozens to hundreds of time but only the user of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix. Moreover, the user does not own the SparseMatrix that we want to analyze/optimize for. Other patterns are likely easier to handle, so let's focus on it for now.

Regarding (i1), I would suggest to introduce a new type, say MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance. Then for (i2) and (i3) we could imagine something like:

MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}

All states in MklSparseMatrix would be mutable.

Between a pair of beginAnalysis/endAnalysis each call to a supported operation would trigger calls to mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would be performed, only calls to mkl_sparse_set_*_hint() and then mkl_sparse_optimize would be called in endAnalysis(). This way mkl_sparse_optimize() would be called only once.

And that's it. Our example would look-like:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
   ...
   if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}

or using a "dry-run" mode:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);

llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D would still be performed, but calls to actual triangular solves would be by-passed
llt.matrixL().endAnalysis();

while(...) {
   ...
   x = llt.solve(b);
...
}

If someone directly deal with the factor L, then we could follow the same pattern or copy the SparseMatrix factor L to a MklSparseMatrix:

SimplicialLLT<SparseMatrix<double> > llt(A);

MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
   ...
y = L.triangularView<Lower>() * x;
...
}

This design in quite general and expendable to any sparse-optimizers, even built-in ones in the future.

In contrast to the current proposal, only selected operations would be passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).

What do you think?

gael

On Tue, Apr 3, 2018 at 11:39 PM, Zhukova, Maria <maria.zhukova@xxxxxxxxx <mailto:maria.zhukova@xxxxxxxxx>> wrote:

    Hello Eigen community,

    My name is Maria Zhukova and I’m a software development engineer at Intel ®
    MKL Sparse team.

    My team is interested in contributing into Eigen, so I’ve investigated our
    possibilities and so far this is what I have:
    Eigen support different operations for sparse matrices stored in CSR and CSC
    format which can be implemented on a basis of IE SpBLAS kernels (please,
    refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines>
    for the general idea of interfaces)
    , basically we want to implement calls to our IE SpBLAS into next
    operations:____

                     SparseMatrix + SparseMatrix (mkl_sparse_?_add)
                     SparseMatrix * DenseVector (mkl_sparse_?_mv)____

                     SparseMatrix * DenseMatrix   (mkl_sparse_?_mm)____

                     SparseMatrix * SparseMatrix (mkl_sparse_spmm),
    and Triangular solve (mkl_sparse_?_trsv).____

    I’ve already started with implementation of sparse_time_dense_impl_mkl
    kernel which is based on mkl_sparse_?_mv (included in patch).____

    This is how it will look like for user:
    *#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____

    void main () {
     SparseMatrix<double, RowMajor> A;
     Matrix<double, Dynamic, 1> x, y;

     A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
    *A.createSparseHandle();*/* *NEW*: is used to create handle required for all
    IE SpBLAS routines */____

    // support of IE SpBLAS is here
    y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
    SPARSE_OPERATION_NON_TRANSPOSE */
    y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
    operation = SPARSE_OPERATION_TRANSPOSE */
    y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
    = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____

    *A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
    }____

    __ __

    I’ve attached a draft patch including all necessary changes and would like
    to hear your feedback.
    Please, let me know if you have any questions and comments.____

    __ __

    Best regards,
    Maria____

    __ __

    __ __

    __ __

--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: christoph.hertzberg@xxxxxxx

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------