Re: [eigen] benchmarking weirdness

[ Thread Index | Date Index | More Archives ]

Thanks Gael for benchmarking on your CPU. Indeed it is much closer to what I 
expected. FWIW, my CPU is a Core 1 duo at 1.66 GHz.

I ran the benchmark_suite several times with various small modifications in 
Eigen. Results in attached spreadsheet. The most interesting thing is the 
last column: here we see that there is no benefit in adapting the traversal 
order to the storage order; instead, we get better results by always 
traversing in column-major order even for row-major matrices. So I'll revert 
that part of today's changes.

Regarding the slowdown of disabling asserts, I also suspect an instruction 
cache miss. Indeed, disabling asserts means that more functions can get 
inlined, which is harder on the instruction cache. That's why I tested 
running the benchmark without using always_inline at certain places. See 
results in columns C,D,E. The result is that indeed the slowdown is reduced 
with column-major order, but not with row-major order. Meanwhile, the 
always_inline are really beneficial for fixed-size matrices. More 
importantly, the functions that I always_inline are trivial (one line, they 
are of the form "return Constructor();") so it seems strange to me that the 
compiler would not inline them.

I would indeed like to make Eigen ICC-compatible; the error that you describe 
might be fixed by moving _RowsAtCompileTime from class Derived to class 
ForwardDecl<Derived> (see in Util.h) or some similar trick. Help is welcome 
here as I am very short on time and have not yet installed ICC.

Note that the standard workaround (such as used in TVMET) is to use an enum 
instead of static const int, but this is not really convenient here as 
Dynamic is set to -1 and C++ enums are not guaranteed to be signed. It is 
still possible to set Dynamic to some very large positive value, and go for 
enums, but then there are throughout the code some "Size>0" conditions that 
would have to be changed to "Size>0 && Size<Dynamic".



On Saturday 05 January 2008 17:43:46 Christian Mayer wrote:
> Gael Guennebaud schrieb:
> > Also, seeing the benchmark code, I don't think that any cache miss
> > occurs since you only have two matrices.
> "Think" is a very bad guide when it comes to performance optimization.
> Onlöy real measurements can count as it's far too easy to make things
> worse by over optimizing (especially true when it comes to manual loop
> unrolling).
> You must not only take care about cache misses for the data but also
> about cache misses for the instructions (that's where loop unrolling can
> really bite you).
> You also must have a look at register useage which can be thought as a
> "level 0" cache. Especially changing between row and column major can
> make a huge difference here.
> And at the end (IIRC VTune can also tell you that) a huge performance
> difference can be achieved by optimization of the branch prediction.
> > I would also suggest to bench with different compilers, the results
> > might be very different. However, eigein2 is currently not compatible
> > with ICC.
> That's sad and we should fix it ASAP. ICC is a very good compiler when
> it comes to optimal performance. It's also quite good at auto
> vectorisation which is crucial for SSE useage (unless you are doing it
> by hand with intrinsics)
> CU,
> Christian

Attachment: eigen2chart.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Attachment: signature.asc
Description: This is a digitally signed message part.

Mail converted by MHonArc 2.6.19+