On 2016-02-05 13:36, Alberto Luaces wrote:
| Eigen version | General algorithm | Hand-coded algorithm |
| 3.2.7         | 0.10s             | 0.04s                |
| 3.3-beta1     | 0.21s             | 0.15s                |

I am attaching a minimal test case for reference.  The bottleneck lies
on the function InertiaTensor::addFace().  The data from the table were
computed with the compilation flags "-O3 -DNDEBUG".  Eigen 3.3-beta1
reports its version as "3.2.92"

That regression definitely does not look good. On my machine, I can only confirm the regression for the hand-coded version, however. The reason appears to be a call to
which is not inlined. I was able to fix that by adding lots of EIGEN_STRONG_INLINE in src/Core/AssignEvaluator.h

@Gael, can you confirm? Or is it better to use EIGEN_ALWAYS_INLINE, here?

Other than that, your code is still not optimal regarding vectorization, partially that is Eigen's "fault", but it is quite hard to automatically decide what can be vectorized efficiently.
G.template leftCols<3>() * G.template leftCols<3>().transpose() + w * w.transpose();
gets vectorized, whereas the following does not:
G.template leftCols<3>() * G.template topRightCorner<3,3>().transpose() + w * w.template head<3>().transpose(); OTOH, your version is not vectorizable (without making the vectorization logic extremely complicated), since G.block<3,3>() will not be accessed packet-wise.


