Re: [eigen] Performance colwise matrix mult vs. raw loop |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen <eigen@xxxxxxxxxxxxxxxxxxx>*Subject*: Re: [eigen] Performance colwise matrix mult vs. raw loop*From*: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>*Date*: Mon, 15 Jan 2018 11:54:11 +0100*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=TMpt0rrnb4IgFBYWOHQQcL4Poc2s4ut7j91oJaGx8VE=; b=fyGdAkHsIl6G9+pYxc7vcsjGvSYQ9DZeGr9D9067P7TMf0K0fXEDx9IWo8qgLPcU+N tUO4+ZLl2DqCflh52wGFpmydVEXvg85xXdo4AZZ1AH49tye2CCU3TTyl7BBcfxceI6fi 7xgOmk7SfqUNgUOfPyxY88rbBWB9B1cUqHLWLjUk+uaLs1eVPDahY6dndXuX4hh2uLOc GU4VqLA20hAqFD4o5U1I6BHSwcKen/LgWK2FGCTL+gLQOLm97Y3NVTkBpSxFpz6S9CaA oj4ZIbn9lBfaIvMmO5REw3T8LGzUDNVc92rW2ct6OKRW6K8CbvG30wXN1bIO4PL3a44O LyFQ==

Hi,

both code are not equivalent because one is applying the transformation in-place, whereas the other one need to allocate a temporary. Nonetheless, the root of the "problem" does not lie here and it is a bit more complicated. Let's consider a simpler _expression_:

Matrix3d A;

Matrix3Xd pts(3,N);

pts = A * pts;

in this _expression_, the product cannot be carried out in-place because of aliasing issue and a temporary has to be created. Of course, as you realized, if we evaluate the result one column at once, then instead of allocating a whole 3xN temporary, it is enough to allocate a 3x1 temporary vector, and since the size is known at compile time and that it is very small, it can be "allocated" on the stack and even optimized away by the compiler. Unfortunately there is not way for Eigen to figure this out, especially at compile-time. When there is no aliasing at all, the user can tell it:

pts_bis.noalias() = A * pts;

In your case, we would need some kind of colwise/rowwise noalias to tell Eigen that there is no aliasing across columns or rows. To be honest I've never though about that possibility, but given the significant performance hit, and that such kind of aliasing is probably the most frequent when talking about matrix product, this might be worth the effort. I'm still not sure what would be good names for the API though.

gael

On Sat, Jan 13, 2018 at 1:20 PM, Norbert Wenzel <norbert.wenzel.lists@xxxxxxxxx> wrote:

Hi,

I'm using Eigen in a program that reads a list of (millions of) 3d

points (in bulk), (rigidly) transforms these points and then processes

the transformed points. During optimizing the program (on Linux) I found

that I could gain a few percent runtime, by replacing a colwise matrix

multiplication with a raw loop that (from my perspective) does

essentially the same. Note that the runtime gain has been achieved for

the whole program, not only for the transformation part alone.

My first question would be if the assumption that the following two code

fragments should do the same and therefore be comparable is correct:

using pointlist = Eigen::Matrix<double, 3, Eigen::Dynamic, Eigen::RowMajor>;

using transform = Eigen::Transform<double, 3, Eigen::Isometry>;

extern pointlist points;

extern const transform trans;

void eigen_matrix_transformation()

{

points = (trans * points.colwise().homogeneous());

}

void raw_loop_transformation()

{

for(auto c = points.cols(); c > 0; --c)

{

points.col(c-1) = trans * points.col(c-1).homogeneous();

}

}

Although the number of points in the matrix is not known at compile time

the dimensionality of the points is, which should be enough to determine

the sizes of any intermediate storage at compile time.

I've extracted the code[0] from my program and found that the difference

between these implementations is ~3x. When comparing the generated

code[1] it seems that the colwise matrix multiplication allocates

dynamic memory whereas the raw loop does not. (At least it may throw

std::bad_alloc.)

Note that I could reproduce these results on different Linux machines,

but not on Windows. Although on my Windows laptop the overall benchmark

ran faster in a Linux virtual machine than on the Windows host.

I'm not sure if I'm using a sub-optimal Eigen function for my task or if

the Geometry module (or at least this part of the module) is not that

optimized, because there are more important parts in Eigen that needed

optimization. So I'd like to hear from you if you can reproduce my

benchmark results and/or consider this an issue. Is this maybe already

known?

I've currently replaced the colwise matrix multiplication with a raw

loop in my code so this is not a pressing issue to me at all. I was just

really surprised with the results I got.

I'd like to hear your opinions about this benchmark.

Thanks for your work on Eigen,

Norbert

[0] https://github.com/norbertwenzel/eigen-benchmark

[1] https://godbolt.org/g/atG6uA

**Follow-Ups**:**Re: [eigen] Performance colwise matrix mult vs. raw loop***From:*Norbert Wenzel

**References**:**[eigen] Performance colwise matrix mult vs. raw loop***From:*Norbert Wenzel

**Messages sorted by:**[ date | thread ]- Prev by Date:
**[eigen] Performance colwise matrix mult vs. raw loop** - Next by Date:
**Re: [eigen] Performance colwise matrix mult vs. raw loop** - Previous by thread:
**[eigen] Performance colwise matrix mult vs. raw loop** - Next by thread:
**Re: [eigen] Performance colwise matrix mult vs. raw loop**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |