Re: [eigen] BLAS backend

[ Thread Index | Date Index | More Archives ]

actually things are a bit more complicated. First of all cublas is not is not 100% BLAS compatible, and so we would either have to explicitely write a backend for cublas, or write a pure BLAS wrapper on top of cublas. The latter approach is the easiest one but not the optimal one beacause that means that for every blas routine we would have to allocate GPU memory, copy the data from the CPU to the GPU, and then get back the result from the GPU to the CPU, free the GPU memory... Nevertheless, for large enough matrices this naive strategy is still quite interesting. Here is a simpel example doing c += a * b:

// copy an Eigen matrix to cuda
void ei2cu(const M& a, void* b, int ldb = 0)
  cublasSetMatrix(a.rows(), a.cols(), sizeof(SCALAR),, a.stride(), b, ldb ? ldb : a.rows());

// copy a matrix from cuda to an Eigen matrix
void cu2ei(const void* a, M& b, int lda = 0)
  cublasGetMatrix(b.rows(), b.cols(), sizeof(SCALAR), a, lda ? lda : b.rows(),, b.stride());

// c += a * b; assuming column major matrices
void cugemm(const M& a, const M& b, M& c)
  void* da;
  cublasAlloc(a.size(), sizeof(SCALAR), &da);
  void* db;
  cublasAlloc(b.size(), sizeof(SCALAR), &db);
  void* dc;
  cublasAlloc(c.size(), sizeof(SCALAR), &dc);
  cu2ei(dc, c);
  cublasSgemm('N', 'N', a.rows(), b.cols(), a.cols(), 1,
              (SCALAR*)da, a.rows(),
              (SCALAR*)db, b.rows(),
              (SCALAR*)dc, c.rows());


and the result of the benchmark (time in second, using floats, GPU=GeForce GTX 280):

matrix size | Eigen (1 core)  | cublas              | cublas but data transferts only

10              1.12414e-06     6.91631e-05             5.0287e-05
50              2.6665e-05      0.00056498             0.000544696
100            0.000152872     0.000714788             0.000657568
200            0.00143242      0.00131176             0.00107884
300            0.00370566      0.00226267              0.00184504
500            0.0192149       0.00613618             0.00468993
1000            0.178364        0.0246601               0.0135479
2000            1.08252         0.133528               0.049156
3000            3.5224          0.385172               0.106104
4000            8.14059         0.666627                0.188644

Note that here I used a single core for the Eigen's CPU version.

Also note how the matrix copies between the CPU and GPU are extremely costly... So, now if we agree that a BLAS backend is only worth it to get GPU support we can think about a smarter approach allowing to keep the data on the GPU as much as possible, and this in a transparent manner for the user. I have a couple of idea about that (and other related stuff) that I'm going to present next week during an internal GPU workshop to people of my research center. My hope is that people working on parallel computing will be interested to build a small collaboration around that. We'll see.


On Thu, Oct 15, 2009 at 3:22 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:

This is very interesting and is in our TODO,

But work on this hasn't started yet. It shouldn't be hard. For
example, if you want to use BLAS GEMM,
you could just plug that in the file
Eigen/src/Core/products/GeneralMatrixMatrix.h, in the function
GeneralProduct<Lhs, Rhs, GemmProduct>::scaleAndAddTo(), at line 144 in
the development branch. Note the little cooking there to get the
"actual" lhs, rhs, alpha.


2009/10/15 Jean Sreng <jean.sreng@xxxxxx>:
> Hello, we are currently investigating the use of Eigen2 in our software
> (VR simulations) and we were wondering about BLAS (and LAPACK) backends.
> Does the newer (development) version of Eigen provide such backend ?
> Otherwise, how difficult would it be to implement such backend
> (partially in a first step, for instance to use BLAS GEMM or
> tridiagonalisation ?) ? Do you have any thoughts about using this
> backend with GPU-optimized versions of BLAS (CUBLAS ?) ?
> Thanks !
> --
> Jean

Mail converted by MHonArc 2.6.19+