On Saturday 05 January 2008 21:51:27 Benoît Jacob wrote:
> Thanks Gael for benchmarking on your CPU. Indeed it is much closer to what
> I expected. FWIW, my CPU is a Core 1 duo at 1.66 GHz.
>
> I ran the benchmark_suite several times with various small modifications in
> Eigen. Results in attached spreadsheet. The most interesting thing is the
> last column: here we see that there is no benefit in adapting the traversal
> order to the storage order; instead, we get better results by always
> traversing in column-major order even for row-major matrices. So I'll
> revert that part of today's changes.
>
> Regarding the slowdown of disabling asserts, I also suspect an instruction
> cache miss. Indeed, disabling asserts means that more functions can get
> inlined, which is harder on the instruction cache. That's why I tested
> running the benchmark without using always_inline at certain places. See
> results in columns C,D,E. The result is that indeed the slowdown is reduced
> with column-major order, but not with row-major order. Meanwhile, the
> always_inline are really beneficial for fixed-size matrices. More
> importantly, the functions that I always_inline are trivial (one line, they
> are of the form "return Constructor();") so it seems strange to me that the
> compiler would not inline them.
>
> I would indeed like to make Eigen ICC-compatible; the error that you
> describe might be fixed by moving _RowsAtCompileTime from class Derived to
> class ForwardDecl<Derived> (see in Util.h) or some similar trick. Help is
> welcome here as I am very short on time and have not yet installed ICC.
>
> Note that the standard workaround (such as used in TVMET) is to use an enum
> instead of static const int, but this is not really convenient here as
> Dynamic is set to -1 and C++ enums are not guaranteed to be signed. It is
> still possible to set Dynamic to some very large positive value, and go for
> enums, but then there are throughout the code some "Size>0" conditions that
> would have to be changed to "Size>0 && Size<Dynamic".
>
> Cheers,
>
> Benoit
>
> On Saturday 05 January 2008 17:43:46 Christian Mayer wrote:
> > Gael Guennebaud schrieb:
> > > Also, seeing the benchmark code, I don't think that any cache miss
> > > occurs since you only have two matrices.
> >
> > "Think" is a very bad guide when it comes to performance optimization.
> > Onlöy real measurements can count as it's far too easy to make things
> > worse by over optimizing (especially true when it comes to manual loop
> > unrolling).
> >
> > You must not only take care about cache misses for the data but also
> > about cache misses for the instructions (that's where loop unrolling can
> > really bite you).
> >
> > You also must have a look at register useage which can be thought as a
> > "level 0" cache. Especially changing between row and column major can
> > make a huge difference here.
> >
> > And at the end (IIRC VTune can also tell you that) a huge performance
> > difference can be achieved by optimization of the branch prediction.
> >
> > > I would also suggest to bench with different compilers, the results
> > > might be very different. However, eigein2 is currently not compatible
> > > with ICC.
> >
> > That's sad and we should fix it ASAP. ICC is a very good compiler when
> > it comes to optimal performance. It's also quite good at auto
> > vectorisation which is crucial for SSE useage (unless you are doing it
> > by hand with intrinsics)
> >
> > CU,
> > Christian