Yes, I'm solving systems involving SparseMatrices.
By scaling I mean halving the CPU time when doubling the number of threads.
I know that this optimal efficiency is achivable with MPI based distributed memory parallelism if the preconditioner is scalable, e.g. diagonal or block diagonal preconditioners. This is what I usually get with PETSc.
Clearly matrices and vectors are distributed in memory based on some partitioning of the problem that must be carried out in a preliminary phase.
Thanks to increasing core counts in modern CPUs it would be great to avoid distributed memory for some problems and combine distributed and shared memory for others. It is easy enough to obtain good scaling for matrix assembly using std::async and std::future, not a big deal. Based on your expertise, do you think that matrix vector products require specific knowledge of the matrix sparsity pattern?
Best regards
Lorenzo
Hi Lorenzo,
Thanks for the input. For Krylov substance solvers most of the work is usually in the matrix-vector multiplies (and possibly the preconditioner if you provide your own), so I'm not sure what you mean by scaling in this case. Are you running, e.g., conjugate gradients on a dense Eigen::Matrix (I assume not) or sparse Eigen::SparseMatrix? In the latter case, most of the scaling would be in the sparse matrix-vector product. The vector operations (dot product, norms etc.) in the Krylov solvers are not multi-threaded in Eigen AFAICT.
However, adding this type of general multi-threaded evaluation to matrix _expression_ in Eigen Core by unifying it with Tensor, is a much longer term project that nobody is working AFAICT, although it has been on the wishlist for a long time.
Best regard,
Rasmus
Dear all,
I hope that this new implementation also improves the scaling of Eigen solvers, e.g. conjugate gradient and biconjugate gradient. What I usually get is a factor 2 speed up independently from the number of threads.
Thanks for the effort.
Best regards.
Lorenzo
Hi Peter,
I would recommend that you hold off on this change for a bit. We are planning on moving the non-blocking threadpool from unsupported into core in the near future and use that as the basis for multi-threading without depending on OpenMP.
Rasmus
Dear all,
I'm currently playing with the sparse matrix implementation within eigen,
namely Eigen::SparseMatrix<double, Eigen::ColMajor> combined with a self adjoint view.
In my application I need the multiplication of a symmetric sparse matrix with a dense matrix,
where the dimension of the matrices are of the order of up to a few ten thousands.
I tried to parallelize the dot product in SparseSelfAdjointView.h by copying a omp directive
from other parts in eigen, and I tried to parallelize the outer loop directly
with std::execution::par, see below.
In summary, I do not see any effect of the parallelization.
Before digging deeper into it and building a minimal working example,
has someone already achieved this?
Could one actually directly call the corresponding MKL routine or are the internal storage schemes different?
Best regards
Peter
P.S.: That's what I tried as a diff to the current master branch:
:~/Eigen/eigen/Eigen/src/SparseCore$ git diff SparseSelfAdjointView.h
diff --git a/Eigen/src/SparseCore/SparseSelfAdjointView.h b/Eigen/src/SparseCore/SparseSelfAdjointView.h
index 0302ef3a4..91e7d495b 100644
--- a/Eigen/src/SparseCore/SparseSelfAdjointView.h
+++ b/Eigen/src/SparseCore/SparseSelfAdjointView.h
@@ -10,6 +10,11 @@
#ifndef EIGEN_SPARSE_SELFADJOINTVIEW_H
#define EIGEN_SPARSE_SELFADJOINTVIEW_H
+#include <iostream> /// only for debugging
+#include <boost/range/irange.hpp>
+
+#include <execution>
+
#include "./InternalHeaderCheck.h"
namespace Eigen {
@@ -295,8 +300,19 @@ inline void sparse_selfadjoint_time_dense_product(const SparseLhsType& lhs, cons
SparseLhsTypeNested lhs_nested(lhs);
LhsEval lhsEval(lhs_nested);
+ Eigen::initParallel();
+ Index threads = Eigen::nbThreads();
+
+ std::cout << "\ndot threads: "<< threads << " rhs-cols: " << rhs.cols() << std::endl;
+
+ // #pragma omp parallel for
+ // #pragma omp parallel for schedule(dynamic,(rhs.cols()+threads*4-1)/(threads*4)) num_threads(threads)
+ // for (Index k=0; k<rhs.cols(); ++k)
+
+ auto r = boost::irange(rhs.cols());
+
+ std::for_each(std::execution::par, r.begin(), r.end(), [&](Index k)
// work on one column at once
- for (Index k=0; k<rhs.cols(); ++k)
{
for (Index j=0; j<lhs.outerSize(); ++j)
{
@@ -330,6 +346,7 @@ inline void sparse_selfadjoint_time_dense_product(const SparseLhsType& lhs, cons
res.coeffRef(j,k) += alpha * i.value() * rhs.coeff(j,k);
}
}
+ );
}