|Re: [eigen] benchmarking weirdness|
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Thanks Gael for benchmarking on your CPU. Indeed it is much closer to what I expected. FWIW, my CPU is a Core 1 duo at 1.66 GHz. I ran the benchmark_suite several times with various small modifications in Eigen. Results in attached spreadsheet. The most interesting thing is the last column: here we see that there is no benefit in adapting the traversal order to the storage order; instead, we get better results by always traversing in column-major order even for row-major matrices. So I'll revert that part of today's changes. Regarding the slowdown of disabling asserts, I also suspect an instruction cache miss. Indeed, disabling asserts means that more functions can get inlined, which is harder on the instruction cache. That's why I tested running the benchmark without using always_inline at certain places. See results in columns C,D,E. The result is that indeed the slowdown is reduced with column-major order, but not with row-major order. Meanwhile, the always_inline are really beneficial for fixed-size matrices. More importantly, the functions that I always_inline are trivial (one line, they are of the form "return Constructor();") so it seems strange to me that the compiler would not inline them. I would indeed like to make Eigen ICC-compatible; the error that you describe might be fixed by moving _RowsAtCompileTime from class Derived to class ForwardDecl<Derived> (see in Util.h) or some similar trick. Help is welcome here as I am very short on time and have not yet installed ICC. Note that the standard workaround (such as used in TVMET) is to use an enum instead of static const int, but this is not really convenient here as Dynamic is set to -1 and C++ enums are not guaranteed to be signed. It is still possible to set Dynamic to some very large positive value, and go for enums, but then there are throughout the code some "Size>0" conditions that would have to be changed to "Size>0 && Size<Dynamic". Cheers, Benoit On Saturday 05 January 2008 17:43:46 Christian Mayer wrote: > Gael Guennebaud schrieb: > > Also, seeing the benchmark code, I don't think that any cache miss > > occurs since you only have two matrices. > > "Think" is a very bad guide when it comes to performance optimization. > OnlÃ¶y real measurements can count as it's far too easy to make things > worse by over optimizing (especially true when it comes to manual loop > unrolling). > > You must not only take care about cache misses for the data but also > about cache misses for the instructions (that's where loop unrolling can > really bite you). > > You also must have a look at register useage which can be thought as a > "level 0" cache. Especially changing between row and column major can > make a huge difference here. > > And at the end (IIRC VTune can also tell you that) a huge performance > difference can be achieved by optimization of the branch prediction. > > > I would also suggest to bench with different compilers, the results > > might be very different. However, eigein2 is currently not compatible > > with ICC. > > That's sad and we should fix it ASAP. ICC is a very good compiler when > it comes to optimal performance. It's also quite good at auto > vectorisation which is crucial for SSE useage (unless you are doing it > by hand with intrinsics) > > CU, > Christian
Description: This is a digitally signed message part.
|Mail converted by MHonArc 2.6.19+||http://listengine.tuxfamily.org/|