Sorry about the subject, since this question is about a very specific

I have written a program that performs as well as (or even a bit worse!)
than other algorithm that is expected to be theoretically an order of
magnitude slower.

The bottleneck is

I4x4 += G.determinant() * G * E * G.transpose();

being I4x4, G and E Eigen::Matrix4d.

- E is constant and symmetric.
- G is homogeneous.
- I4x4 is therefore also symmetric.

The profiler shows in "release" mode that most of the time is spent in

Eigen::internal::call_dense_assignment_loop<Eigen::Matrix<double, 4, 4, 0, 4, 4>, …

but I guess that "assignment" is not only the update of I4x4, but the
evaluation of the whole expression.

Is there any obvious optimization advice that can be applied to improve
the performance?

Thank you!

