Re: [eigen] Blas performance on mapped matrices |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Blas performance on mapped matrices
- From: Sameer Agarwal <sameeragarwal@xxxxxxxxxx>
- Date: Tue, 10 Jan 2012 13:12:08 -0800
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :x-system-of-record:content-type:content-transfer-encoding; bh=9F4X+AvomCYiAFn7EnK6jYD+GJnx06HN5TnAwf7i7kI=; b=0OhmDVaZYnsj0AxSeE0ftgFWHBWNBD4UqJ+xxxh48Wma+7gHIT9Zl8wcR4WtFPpcPv y29Y/3H8TzkLwSxkyVhl0RvvkNJoJ2vo9x1QEUEeKa3npJTjBfZ26aV9urGmu3j+oPAD jPER8ECCQiX8PVZqWy2av3vwz5+jcMUrHjap4=
Hi Guys,
Following up on this further. I am having similar problems with gemv operations.
y.noalias() += A * x
where y, A and x are all Mapped objects. Again, we are talking about
sizes from 3-9 in our tests. This product is twice as slow as before.
replacing this with
y += A.lazyProduct(x)
improves things, but its still slower than eigen2.
Am I missing something obvious here?
Sameer
On Mon, Jan 9, 2012 at 3:20 PM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:
> Hi Gael,
> Thanks for the suggestions. I was able to squeeze out a couple more
> percent of performance out of the code.
>
> Unfortunately our matrices are not aligned. We are working with block
> sparse matrices where each block is dense, and we allocate the whole
> matrix as one large array and then use Map to manipulate individual
> blocks. Getting aligned blocks would require changing the way we
> allocate and manage these block sparse matrices with padding between
> blocks to account for alignment.
>
> We are using double everywhere and gcc 4.6.x
>
> Sameer
>
>
>
>
> On Mon, Jan 9, 2012 at 12:35 PM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> On Mon, Jan 9, 2012 at 8:16 PM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:
>>> @jitse, @gael and @benoit
>>>
>>> setting EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD to 16 increases
>>> performance significantly and we are now within hitting range of the
>>> eigen2 implementation. We are still about ~5% slower than the eigen2
>>> implementation. But this is something we can live with.
>>
>> I'm really surprised because I observed the opposite. Do you use
>> double or float? gcc version?
>>
>>>
>>> I am already using noalias and it does help. As for static sizing a I
>>> mentioned in my follow up email, the constructor syntax that I am
>>> looking for is
>>>
>>> block<rowsize, colsize>(r, c, rowsize1, colsize1)
>>>
>>> very much along the lines of Matrix<rowsize, colsize>(rowsize1, colsize1)
>>
>> You can mimic it like this:
>>
>> Block<MatrixType,rowsize,colsize>(mat, r, c, rowsize1, colsize1)
>>
>> though I don't see any reason not to add what you proposed.
>>
>>
>>> While I have your attention on the subject, Am I correct in assuming that
>>>
>>> A.rankupdate(B.transpose(), 1.0)
>>>
>>> assumes that there is no aliasing between A and B?
>>
>> you are right.
>>
>>> and for small
>>> matrices is there something along the lines of
>>> EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD that I need to worry about?
>>
>> Instead of playing with this parameter you can also on a per product
>> basis enforce the use of an expression-based product implementation
>> using lazyProduct:
>>
>> A.lazyProduct(B)
>>
>> For small matrices, alignment is important to get advantage of
>> vectorization. If you can guaranty alignment, then you can tell it to
>> Eigen when you use Map, e.g., Map<Type,Eigen::Aligned>
>>
>>> Along the same lines if I have a symmetric matrix A and matrices B and
>>> C whose product I know will be symmetric.
>>
>> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>() += B * C;
>>
>> will do the job, but as you noticed for small matrices this is not
>> necessarily faster than doing the whole product because of the higher
>> logic complexity and lack of vectorization nearby the diagonal.
>>
>> gael.
>>
>>> Then what is the best way to do
>>>
>>> A.block(r,c, rsize, csize).selfadjointView<Eigen::Upper>().noalias() += B * C;
>>>
>>> because that does not work.
>>>
>>> I could just take the upper triangular part of the product
>>>
>>> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>().noalias() += B * C;
>>>
>>> but that complains about there being no noalias() defined for triangular views.
>>>
>>> so what I end up doing is either
>>>
>>> A.block(r,c, rsize, csize).noalias += B*C;
>>>
>>> or
>>>
>>> A.block(r,c, rsize, csize).triangularView<Eigen::Upper>() += B * C;
>>>
>>> The slightly better performing solution at this stage seems to be
>>>
>>> A.block(r,c, rsize, csize).noalias += B*C;
>>>
>>> Sameer
>>>
>>> On Mon, Jan 9, 2012 at 8:09 AM, Gael Guennebaud
>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>> I really cannot reproduce, on my system all the variants using Eigen3
>>>> are faster than the best I can get out of Eigen2 (I used double).
>>>>
>>>> The result of my quick experiments also shown that;
>>>>
>>>> A.block<size1, size2>(r, c).noalias() -= B * C;
>>>>
>>>> is indeed the best you can do, and both the "noalias" and static sized
>>>> block are useful. It is in particular faster than:
>>>>
>>>> A.block<size1, size2>(r, c) -= B.lazyProduct(C);
>>>>
>>>> which uses an expression based product algorithm (tailored for very
>>>> small products).
>>>>
>>>> gael.
>>>>
>>>> -----------------------------------------
>>>>
>>>> #include <iostream>
>>>> #include <Eigen/Dense>
>>>> #include <bench/BenchTimer.h>
>>>> using namespace Eigen;
>>>>
>>>> typedef double Scalar;
>>>> typedef Matrix<Scalar,Dynamic,Dynamic, RowMajor> Mat;
>>>>
>>>> EIGEN_DONT_INLINE void foo1(Scalar* dat1, Scalar* dat2, Mat& A, int i, int j)
>>>> {
>>>> Block<Mat,9,9>(A,i,j).noalias() -= (Map< Matrix<Scalar,9,3,RowMajor>
>>>>>(dat1) * Map< Matrix<Scalar,3,9,RowMajor> >(dat2));
>>>> }
>>>>
>>>> int main (int argc, char** argv)
>>>> {
>>>> Matrix<Scalar,27,1> data1, data2;
>>>> data1.setRandom();
>>>> data2.setRandom();
>>>>
>>>> Mat A(100,100);
>>>>
>>>> BenchTimer t1;
>>>> int tries = 10;
>>>> int rep = 10000;
>>>>
>>>>
>>>> BENCH(t1, tries, rep, foo1(data1.data(), data2.data(), A, 2,3););
>>>> std::cerr << t1.best() << "s\n";
>>>>
>>>>
>>>> return (0);
>>>> }
>>>>
>>>>
>>>> On Mon, Jan 9, 2012 at 2:40 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>>> 2012/1/9 Sameer Agarwal <sameeragarwal@xxxxxxxxxx>:
>>>>>> Hi Guys,
>>>>>> We are in the process of a significant code migration from eigen2 to
>>>>>> eigen3. The code uses Eigen::Map to map chunks of memory into RowMajor
>>>>>> matrices and operates on them. The primary operation is of the form
>>>>>>
>>>>>> A.block(r, c, size1, size2) -= B * C;
>>>>>>
>>>>>> A is a mapped matrix.
>>>>>> C is a mapped matrix.
>>>>>> B is an actual Eigen matrix.
>>>>>>
>>>>>> All matrices are RowMajor. For the example being considered, size1 =
>>>>>> size2 = 9. B is 9x3, and C is 3x9.
>>>>>> C and B are statically sized.
>>>>>>
>>>>>> Moving from eigen2 to eigen3 has resulting in a 30% performance
>>>>>> regression. Has something changed significantly in the way Eigen3
>>>>>> handles mapped matrices, or about the structure of matrix-matrix
>>>>>> multiplication in Eigen3 that would cause this?
>>>>>>
>>>>>> The compiler flags are all the same between our use of eigen2 and
>>>>>> eigen3. Profiling indicates that much of the time is being spent
>>>>>> inside Eigen::internal::gebp_kernel::operator.
>>>>>>
>>>>>> I understand that this is not sufficient information to reproduce this
>>>>>> problem, so I am going to try and create a minimal case which can
>>>>>> reproduce this performance regression. In the meanwhile any insight
>>>>>> into this would be useful. Also is it possible to statically size
>>>>>> blocks like matrices?
>>>>>
>>>>> Yes, as explained on http://eigen.tuxfamily.org/dox/TutorialBlockOperations.html
>>>>> (also see Jitse's email, using that syntax).
>>>>>
>>>>> I agree with Jitse's suggestion of playing with .noalias() and with
>>>>> EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD, especially given your very
>>>>> special size where one of the two dimensions only is greater than the
>>>>> default threshold, it's very tempting to suspect that's the cause of
>>>>> your regression.
>>>>> Regarding noalias(), see this page:
>>>>> http://eigen.tuxfamily.org/dox/TopicWritingEfficientProductExpression..html
>>>>> Cheers,
>>>>> Benoit
>>>>>
>>>>>>
>>>>>> Thank you,
>>>>>> Sameer
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>