Re: [eigen] Blas performance on mapped matrices

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Gentlemen,
Thank you for your quick and helpful replies. I will try to address
all of them in one email rather than individual replies.

@tim: I checked and there are no temporaries being created.
@jitse, @gael and @benoit

setting EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD to 16 increases
performance significantly and we are now within hitting range of the
eigen2 implementation. We are still about ~5% slower than the eigen2
implementation. But this is something we can live with.

I am already using noalias and it does help. As for static sizing a I
mentioned in my follow up email, the constructor syntax that I am
looking for is

block<rowsize, colsize>(r, c, rowsize1, colsize1)

very much along the lines of Matrix<rowsize, colsize>(rowsize1, colsize1)

where the template parameters override the dynamic parameters when its
not set to Eigen::Dynamic. This allows me to do some template
specialization optimizations.

While I have your attention on the subject,  Am I correct in assuming that

A.rankupdate(B.transpose(), 1.0)

assumes that there is no aliasing between A and B? and for small
matrices is there something along the lines of
EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD that I need to worry about?

Along the same lines if I have a symmetric matrix A and matrices B and
C whose product I know will be symmetric.

Then what is the best way to do

A.block(r,c, rsize, csize).selfadjointView<Eigen::Upper>().noalias() += B * C;

because that does not work.

I could just take the upper triangular part of the product

A.block(r,c, rsize, csize).triangularView<Eigen::Upper>().noalias() += B * C;

but that complains about there being no noalias() defined for triangular views.

so what I end up doing is either

A.block(r,c, rsize, csize).noalias += B*C;

or

A.block(r,c, rsize, csize).triangularView<Eigen::Upper>() += B * C;

The slightly better performing solution at this stage seems to be

A.block(r,c, rsize, csize).noalias += B*C;

Sameer

On Mon, Jan 9, 2012 at 8:09 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> I really cannot reproduce, on my system all the variants using Eigen3
> are faster than the best I can get out of Eigen2 (I used double).
>
> The result of my quick experiments also shown that;
>
> A.block<size1, size2>(r, c).noalias() -= B * C;
>
> is indeed the best you can do, and both the "noalias" and static sized
> block are useful. It is in particular faster than:
>
> A.block<size1, size2>(r, c) -= B.lazyProduct(C);
>
> which uses an expression based product algorithm (tailored for very
> small products).
>
> gael.
>
> -----------------------------------------
>
> #include <iostream>
> #include <Eigen/Dense>
> #include <bench/BenchTimer.h>
> using namespace Eigen;
>
> typedef double Scalar;
> typedef Matrix<Scalar,Dynamic,Dynamic, RowMajor> Mat;
>
> EIGEN_DONT_INLINE void foo1(Scalar* dat1, Scalar* dat2, Mat& A, int i, int j)
> {
>  Block<Mat,9,9>(A,i,j).noalias() -= (Map< Matrix<Scalar,9,3,RowMajor>
>>(dat1) * Map< Matrix<Scalar,3,9,RowMajor> >(dat2));
> }
>
> int main (int argc, char** argv)
> {
>  Matrix<Scalar,27,1> data1, data2;
>  data1.setRandom();
>  data2.setRandom();
>
>  Mat A(100,100);
>
>  BenchTimer t1;
>  int tries = 10;
>  int rep = 10000;
>
>
>  BENCH(t1, tries, rep, foo1(data1.data(), data2.data(), A, 2,3););
>  std::cerr << t1.best() << "s\n";
>
>
>  return (0);
> }
>
>
> On Mon, Jan 9, 2012 at 2:40 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2012/1/9 Sameer Agarwal <sameeragarwal@xxxxxxxxxx>:
>>> Hi Guys,
>>> We are in the process of a significant code migration from eigen2 to
>>> eigen3. The code uses Eigen::Map to map chunks of memory into RowMajor
>>> matrices and operates on them. The primary operation is of the form
>>>
>>> A.block(r, c, size1, size2) -= B * C;
>>>
>>> A is a mapped matrix.
>>> C is a mapped matrix.
>>> B is an actual Eigen matrix.
>>>
>>> All matrices are RowMajor. For the example being considered, size1 =
>>> size2 = 9. B is 9x3, and C is 3x9.
>>> C and B are statically sized.
>>>
>>> Moving from eigen2 to eigen3 has resulting in a 30% performance
>>> regression. Has something changed significantly in the way Eigen3
>>> handles mapped matrices, or about the structure of matrix-matrix
>>> multiplication in Eigen3 that would cause this?
>>>
>>> The compiler flags are all the same between our use of eigen2 and
>>> eigen3. Profiling indicates that much of the time is being spent
>>> inside Eigen::internal::gebp_kernel::operator.
>>>
>>> I understand that this is not sufficient information to reproduce this
>>> problem, so I am going to try and create a minimal case which can
>>> reproduce this performance regression. In the meanwhile any insight
>>> into this would be useful.  Also is it possible to statically size
>>> blocks like matrices?
>>
>> Yes, as explained on http://eigen.tuxfamily.org/dox/TutorialBlockOperations.html
>> (also see Jitse's email, using that syntax).
>>
>> I agree with Jitse's suggestion of playing with .noalias() and with
>> EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD, especially given your very
>> special size where one of the two dimensions only is greater than the
>> default threshold, it's very tempting to suspect that's the cause of
>> your regression.
>> Regarding noalias(),  see this page:
>> http://eigen.tuxfamily.org/dox/TopicWritingEfficientProductExpression.html
>> Cheers,
>> Benoit
>>
>>>
>>> Thank you,
>>> Sameer
>>>
>>>
>>
>>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/