Re: [eigen] Blas performance on mapped matrices

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] Blas performance on mapped matrices
From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
Date: Mon, 9 Jan 2012 21:35:58 +0100
Cc: Keir Mierle <keir@xxxxxxxxxx>
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=GGzHETYu2poi5RPIhzmRDpZsD1f8F9fz4Qq+6Nz49SU=; b=Po2sdfPPz/rRJybZE24ky5r76CYYMygOBs25Vt3EajyId2fwE23dlckwYiueQcKvo3 YGG8VQRuXhEY9VRUomTioWCfk8aYjeaD1iT/u5V7yNklmsbbkoXbA4X7hCE4c5n2ho4L SfvR1rZg3KzcBlSP383gOCZuLvts9JvlxOq1g=

On Mon, Jan 9, 2012 at 8:16 PM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:
> @jitse, @gael and @benoit
>
> setting EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD to 16 increases
> performance significantly and we are now within hitting range of the
> eigen2 implementation. We are still about ~5% slower than the eigen2
> implementation. But this is something we can live with.

I'm really surprised because I observed the opposite. Do you use
double or float? gcc version?

>
> I am already using noalias and it does help. As for static sizing a I
> mentioned in my follow up email, the constructor syntax that I am
> looking for is
>
> block<rowsize, colsize>(r, c, rowsize1, colsize1)
>
> very much along the lines of Matrix<rowsize, colsize>(rowsize1, colsize1)

You can mimic it like this:

Block<MatrixType,rowsize,colsize>(mat, r, c, rowsize1, colsize1)

though I don't see any reason not to add what you proposed.


> While I have your attention on the subject,  Am I correct in assuming that
>
> A.rankupdate(B.transpose(), 1.0)
>
> assumes that there is no aliasing between A and B?

you are right.

> and for small
> matrices is there something along the lines of
> EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD that I need to worry about?

Instead of playing with this parameter you can also on a per product
basis enforce the use of an expression-based product implementation
using lazyProduct:

A.lazyProduct(B)

For small matrices, alignment is important to get advantage of
vectorization. If you can guaranty alignment, then you can tell it to
Eigen when you use Map, e.g., Map<Type,Eigen::Aligned>

> Along the same lines if I have a symmetric matrix A and matrices B and
> C whose product I know will be symmetric.

A.block(r,c, rsize, csize).triangularView<Eigen::Upper>() += B * C;

will do the job, but as you noticed for small matrices this is not
necessarily faster than doing the whole product because of the higher
logic complexity and lack of vectorization nearby the diagonal.

gael.

> Then what is the best way to do
>
> A.block(r,c, rsize, csize).selfadjointView<Eigen::Upper>().noalias() += B * C;
>
> because that does not work.
>
> I could just take the upper triangular part of the product
>
> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>().noalias() += B * C;
>
> but that complains about there being no noalias() defined for triangular views.
>
> so what I end up doing is either
>
> A.block(r,c, rsize, csize).noalias += B*C;
>
> or
>
> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>() += B * C;
>
> The slightly better performing solution at this stage seems to be
>
> A.block(r,c, rsize, csize).noalias += B*C;
>
> Sameer
>
> On Mon, Jan 9, 2012 at 8:09 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> I really cannot reproduce, on my system all the variants using Eigen3
>> are faster than the best I can get out of Eigen2 (I used double).
>>
>> The result of my quick experiments also shown that;
>>
>> A.block<size1, size2>(r, c).noalias() -= B * C;
>>
>> is indeed the best you can do, and both the "noalias" and static sized
>> block are useful. It is in particular faster than:
>>
>> A.block<size1, size2>(r, c) -= B.lazyProduct(C);
>>
>> which uses an expression based product algorithm (tailored for very
>> small products).
>>
>> gael.
>>
>> -----------------------------------------
>>
>> #include <iostream>
>> #include <Eigen/Dense>
>> #include <bench/BenchTimer.h>
>> using namespace Eigen;
>>
>> typedef double Scalar;
>> typedef Matrix<Scalar,Dynamic,Dynamic, RowMajor> Mat;
>>
>> EIGEN_DONT_INLINE void foo1(Scalar* dat1, Scalar* dat2, Mat& A, int i, int j)
>> {
>>  Block<Mat,9,9>(A,i,j).noalias() -= (Map< Matrix<Scalar,9,3,RowMajor>
>>>(dat1) * Map< Matrix<Scalar,3,9,RowMajor> >(dat2));
>> }
>>
>> int main (int argc, char** argv)
>> {
>>  Matrix<Scalar,27,1> data1, data2;
>>  data1.setRandom();
>>  data2.setRandom();
>>
>>  Mat A(100,100);
>>
>>  BenchTimer t1;
>>  int tries = 10;
>>  int rep = 10000;
>>
>>
>>  BENCH(t1, tries, rep, foo1(data1.data(), data2.data(), A, 2,3););
>>  std::cerr << t1.best() << "s\n";
>>
>>
>>  return (0);
>> }
>>
>>
>> On Mon, Jan 9, 2012 at 2:40 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>> 2012/1/9 Sameer Agarwal <sameeragarwal@xxxxxxxxxx>:
>>>> Hi Guys,
>>>> We are in the process of a significant code migration from eigen2 to
>>>> eigen3. The code uses Eigen::Map to map chunks of memory into RowMajor
>>>> matrices and operates on them. The primary operation is of the form
>>>>
>>>> A.block(r, c, size1, size2) -= B * C;
>>>>
>>>> A is a mapped matrix.
>>>> C is a mapped matrix.
>>>> B is an actual Eigen matrix.
>>>>
>>>> All matrices are RowMajor. For the example being considered, size1 =
>>>> size2 = 9. B is 9x3, and C is 3x9.
>>>> C and B are statically sized.
>>>>
>>>> Moving from eigen2 to eigen3 has resulting in a 30% performance
>>>> regression. Has something changed significantly in the way Eigen3
>>>> handles mapped matrices, or about the structure of matrix-matrix
>>>> multiplication in Eigen3 that would cause this?
>>>>
>>>> The compiler flags are all the same between our use of eigen2 and
>>>> eigen3. Profiling indicates that much of the time is being spent
>>>> inside Eigen::internal::gebp_kernel::operator.
>>>>
>>>> I understand that this is not sufficient information to reproduce this
>>>> problem, so I am going to try and create a minimal case which can
>>>> reproduce this performance regression. In the meanwhile any insight
>>>> into this would be useful.  Also is it possible to statically size
>>>> blocks like matrices?
>>>
>>> Yes, as explained on http://eigen.tuxfamily.org/dox/TutorialBlockOperations.html
>>> (also see Jitse's email, using that syntax).
>>>
>>> I agree with Jitse's suggestion of playing with .noalias() and with
>>> EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD, especially given your very
>>> special size where one of the two dimensions only is greater than the
>>> default threshold, it's very tempting to suspect that's the cause of
>>> your regression.
>>> Regarding noalias(),  see this page:
>>> http://eigen.tuxfamily.org/dox/TopicWritingEfficientProductExpression.html
>>> Cheers,
>>> Benoit
>>>
>>>>
>>>> Thank you,
>>>> Sameer
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Follow-Ups:
- Re: [eigen] Blas performance on mapped matrices
  - From: Gael Guennebaud
- Re: [eigen] Blas performance on mapped matrices
  - From: Sameer Agarwal

References:
- [eigen] Blas performance on mapped matrices
  - From: Sameer Agarwal
- Re: [eigen] Blas performance on mapped matrices
  - From: Benoit Jacob
- Re: [eigen] Blas performance on mapped matrices
  - From: Gael Guennebaud
- Re: [eigen] Blas performance on mapped matrices
  - From: Sameer Agarwal

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Blas performance on mapped matrices
Next by Date: Re: [eigen] Blas performance on mapped matrices
Previous by thread: Re: [eigen] Blas performance on mapped matrices
Next by thread: Re: [eigen] Blas performance on mapped matrices

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/