Re: [eigen] array functionality...

[ Thread Index | Date Index | More Archives ]

I've enabled vectorization for Replicate, but that's not spectacular. Here is what I get:


method man.: 0.190805
method a...: 0.185623
method b...: 0.188882
method c...: 0.630628

after the (trivial) change:

method man.: 0.187031
method a...: 0.185546
method b...: 0.186756
method c...: 0.414688

the problem is probably the compiler does not optimize very well the modulo in the coeff/packet functions:

return m_matrix.template packet<LoadMode>(row%m_matrix.rows(), col%m_matrix.cols());

So maybe Replicate should use the return by value mechanism and issues for loops over the "blocks"... just an idea for the days will look at it more seriously...


On Tue, Mar 9, 2010 at 9:01 PM, Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx> wrote:
Just wanted to let you know that GCC performs as expected - ignoring
what Benoit just confirmed to be probably a little bug.

error......: 0

method man.: 0.172443
method a...: 0.148587
method b...: 0.149701
method c...: 0.584348

Expected in the sense that GCC'ed Eigen beats the manual path.

- Hauke

On Tue, Mar 9, 2010 at 8:42 PM, Hauke Heibel
> On Tue, Mar 9, 2010 at 3:26 PM, Benoit Jacob <jacob.benoit..1@xxxxxxxxx> wrote:
>>> Just did that and the Eigen-fied version
>>> norms = (x.replicate(1,y.cols()) - y).matrix().squaredNorm()
>>> is way slower...
>> How about using a colwise() here?
> Which is what I actually did - it was just a typo. I also know right
> now, why this is so much slower. The issue is that the final reduction
> does not see that this is vectorizable so an unvectorized path is
> chosen.
>> (Dont remember for sure if squaredNorm is available in partial
>> reductions, but if it's not then it's easy to add, or you can replace
>> by this:
>> norms = (x-y).abs2().colwise().sum()
> That one was a quite good hint since now I am getting vectorization.
> I attached an example of computing the column-wise squared norm of a
> matrix. I tried out four possibilities.
> 1) manual (0.163722 secs)
> 2) semi-manual, loop+abs2().sum() (0.360112 secs)
> 3) semi-manual, loop+matrix().squaredNorm() (0.358127 secs)
> 4) full-automatic (1.1833 secs)
> On MSVC 1) is the clear winner - probably and hopefully, in GCC 1/2
> and 3 will be en par
> 2) and 3) perform nearly identical
> 4) is loosing since a non-vectorized path is chosen
> I don't want to cause more work than you already have right now - so
> letting this topic rest is fine with me.
> There is only one thing I would like to bring up for the future. Eigen
> is offering many possibilities to solve one and the same problem. In
> general, it is clear that not all of them offer or even can offer the
> same performance -- nonetheless I think we might consider making
> people more sensitive about this fact by adding some information to
> the docs.
> I will put a marker on this post and try to find some time in the future.
> - Hauke

Mail converted by MHonArc 2.6.19+