Re: [eigen] Blas performance on mapped matrices

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi Gael,
Thanks for the suggestions. I was able to squeeze out a couple more
percent of performance out of the code.

Unfortunately our matrices are not aligned. We are working with block
sparse matrices where each block is dense, and we allocate the whole
matrix as one large array and then use Map to manipulate individual
blocks. Getting aligned blocks would require changing the way we
allocate and manage these block sparse matrices with padding between
blocks to account for alignment.

We are using double everywhere and gcc 4.6.x

Sameer




On Mon, Jan 9, 2012 at 12:35 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> On Mon, Jan 9, 2012 at 8:16 PM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:
>> @jitse, @gael and @benoit
>>
>> setting EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD to 16 increases
>> performance significantly and we are now within hitting range of the
>> eigen2 implementation. We are still about ~5% slower than the eigen2
>> implementation. But this is something we can live with.
>
> I'm really surprised because I observed the opposite. Do you use
> double or float? gcc version?
>
>>
>> I am already using noalias and it does help. As for static sizing a I
>> mentioned in my follow up email, the constructor syntax that I am
>> looking for is
>>
>> block<rowsize, colsize>(r, c, rowsize1, colsize1)
>>
>> very much along the lines of Matrix<rowsize, colsize>(rowsize1, colsize1)
>
> You can mimic it like this:
>
> Block<MatrixType,rowsize,colsize>(mat, r, c, rowsize1, colsize1)
>
> though I don't see any reason not to add what you proposed.
>
>
>> While I have your attention on the subject,  Am I correct in assuming that
>>
>> A.rankupdate(B.transpose(), 1.0)
>>
>> assumes that there is no aliasing between A and B?
>
> you are right.
>
>> and for small
>> matrices is there something along the lines of
>> EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD that I need to worry about?
>
> Instead of playing with this parameter you can also on a per product
> basis enforce the use of an expression-based product implementation
> using lazyProduct:
>
> A.lazyProduct(B)
>
> For small matrices, alignment is important to get advantage of
> vectorization. If you can guaranty alignment, then you can tell it to
> Eigen when you use Map, e.g., Map<Type,Eigen::Aligned>
>
>> Along the same lines if I have a symmetric matrix A and matrices B and
>> C whose product I know will be symmetric.
>
> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>() += B * C;
>
> will do the job, but as you noticed for small matrices this is not
> necessarily faster than doing the whole product because of the higher
> logic complexity and lack of vectorization nearby the diagonal.
>
> gael.
>
>> Then what is the best way to do
>>
>> A.block(r,c, rsize, csize).selfadjointView<Eigen::Upper>().noalias() += B * C;
>>
>> because that does not work.
>>
>> I could just take the upper triangular part of the product
>>
>> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>().noalias() += B * C;
>>
>> but that complains about there being no noalias() defined for triangular views.
>>
>> so what I end up doing is either
>>
>> A.block(r,c, rsize, csize).noalias += B*C;
>>
>> or
>>
>> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>() += B * C;
>>
>> The slightly better performing solution at this stage seems to be
>>
>> A.block(r,c, rsize, csize).noalias += B*C;
>>
>> Sameer
>>
>> On Mon, Jan 9, 2012 at 8:09 AM, Gael Guennebaud
>> <gael.guennebaud@xxxxxxxxx> wrote:
>>> I really cannot reproduce, on my system all the variants using Eigen3
>>> are faster than the best I can get out of Eigen2 (I used double).
>>>
>>> The result of my quick experiments also shown that;
>>>
>>> A.block<size1, size2>(r, c).noalias() -= B * C;
>>>
>>> is indeed the best you can do, and both the "noalias" and static sized
>>> block are useful. It is in particular faster than:
>>>
>>> A.block<size1, size2>(r, c) -= B.lazyProduct(C);
>>>
>>> which uses an expression based product algorithm (tailored for very
>>> small products).
>>>
>>> gael.
>>>
>>> -----------------------------------------
>>>
>>> #include <iostream>
>>> #include <Eigen/Dense>
>>> #include <bench/BenchTimer.h>
>>> using namespace Eigen;
>>>
>>> typedef double Scalar;
>>> typedef Matrix<Scalar,Dynamic,Dynamic, RowMajor> Mat;
>>>
>>> EIGEN_DONT_INLINE void foo1(Scalar* dat1, Scalar* dat2, Mat& A, int i, int j)
>>> {
>>>  Block<Mat,9,9>(A,i,j).noalias() -= (Map< Matrix<Scalar,9,3,RowMajor>
>>>>(dat1) * Map< Matrix<Scalar,3,9,RowMajor> >(dat2));
>>> }
>>>
>>> int main (int argc, char** argv)
>>> {
>>>  Matrix<Scalar,27,1> data1, data2;
>>>  data1.setRandom();
>>>  data2.setRandom();
>>>
>>>  Mat A(100,100);
>>>
>>>  BenchTimer t1;
>>>  int tries = 10;
>>>  int rep = 10000;
>>>
>>>
>>>  BENCH(t1, tries, rep, foo1(data1.data(), data2.data(), A, 2,3););
>>>  std::cerr << t1.best() << "s\n";
>>>
>>>
>>>  return (0);
>>> }
>>>
>>>
>>> On Mon, Jan 9, 2012 at 2:40 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>> 2012/1/9 Sameer Agarwal <sameeragarwal@xxxxxxxxxx>:
>>>>> Hi Guys,
>>>>> We are in the process of a significant code migration from eigen2 to
>>>>> eigen3. The code uses Eigen::Map to map chunks of memory into RowMajor
>>>>> matrices and operates on them. The primary operation is of the form
>>>>>
>>>>> A.block(r, c, size1, size2) -= B * C;
>>>>>
>>>>> A is a mapped matrix.
>>>>> C is a mapped matrix.
>>>>> B is an actual Eigen matrix.
>>>>>
>>>>> All matrices are RowMajor. For the example being considered, size1 =
>>>>> size2 = 9. B is 9x3, and C is 3x9.
>>>>> C and B are statically sized.
>>>>>
>>>>> Moving from eigen2 to eigen3 has resulting in a 30% performance
>>>>> regression. Has something changed significantly in the way Eigen3
>>>>> handles mapped matrices, or about the structure of matrix-matrix
>>>>> multiplication in Eigen3 that would cause this?
>>>>>
>>>>> The compiler flags are all the same between our use of eigen2 and
>>>>> eigen3. Profiling indicates that much of the time is being spent
>>>>> inside Eigen::internal::gebp_kernel::operator.
>>>>>
>>>>> I understand that this is not sufficient information to reproduce this
>>>>> problem, so I am going to try and create a minimal case which can
>>>>> reproduce this performance regression. In the meanwhile any insight
>>>>> into this would be useful.  Also is it possible to statically size
>>>>> blocks like matrices?
>>>>
>>>> Yes, as explained on http://eigen.tuxfamily.org/dox/TutorialBlockOperations.html
>>>> (also see Jitse's email, using that syntax).
>>>>
>>>> I agree with Jitse's suggestion of playing with .noalias() and with
>>>> EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD, especially given your very
>>>> special size where one of the two dimensions only is greater than the
>>>> default threshold, it's very tempting to suspect that's the cause of
>>>> your regression.
>>>> Regarding noalias(),  see this page:
>>>> http://eigen.tuxfamily.org/dox/TopicWritingEfficientProductExpression.html
>>>> Cheers,
>>>> Benoit
>>>>
>>>>>
>>>>> Thank you,
>>>>> Sameer
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/