Re: [eigen] benchmarking weirdness

On Jan 5, 2008 10:15 PM, Benoît Jacob < jacob@xxxxxxxxxxxxxxx> wrote:

Regarding intel's binary stuff:
I tried downloading vtune, was very annoyed to have to register, had a good
laugh when I saw that the .tar.gz file to download weighs 500 MB, started the
download which took long (because their server serves 100 KB/s), only to
notice now that the download failed at around 90% and I'm left with an
incomplete file.

I don't want to be of bad will, but I'm afraid that this is all the time that
I can spend on it right now.

That, plus the obvious conflict of interest when a CPU manufacturer provides
the profiling/optimizing software... moreover in binary-only form... sounds
very fishy to me.

Feel free to profile using vtume and port to icc (the enums trick should work,
it's actually an old trick used for old compilers, which suggests that icc's
c++ frontend is not very modern). I am of course very interested in vtune
results and in icc support. Only I can't do everything myself :)

Cheers,

Benoit

On Saturday 05 January 2008 21:51:27 Benoît Jacob wrote:
> Thanks Gael for benchmarking on your CPU. Indeed it is much closer to what
> I expected. FWIW, my CPU is a Core 1 duo at 1.66 GHz.
>
> I ran the benchmark_suite several times with various small modifications in
> Eigen. Results in attached spreadsheet. The most interesting thing is the
> last column: here we see that there is no benefit in adapting the traversal
> order to the storage order; instead, we get better results by always
> traversing in column-major order even for row-major matrices. So I'll
> revert that part of today's changes.
>
> Regarding the slowdown of disabling asserts, I also suspect an instruction
> cache miss. Indeed, disabling asserts means that more functions can get
> inlined, which is harder on the instruction cache. That's why I tested
> running the benchmark without using always_inline at certain places. See
> results in columns C,D,E. The result is that indeed the slowdown is reduced
> with column-major order, but not with row-major order. Meanwhile, the
> always_inline are really beneficial for fixed-size matrices. More
> importantly, the functions that I always_inline are trivial (one line, they
> are of the form "return Constructor();") so it seems strange to me that the
> compiler would not inline them.
>
> I would indeed like to make Eigen ICC-compatible; the error that you
> describe might be fixed by moving _RowsAtCompileTime from class Derived to
> class ForwardDecl<Derived> (see in Util.h) or some similar trick. Help is
> welcome here as I am very short on time and have not yet installed ICC.
>
> Note that the standard workaround (such as used in TVMET) is to use an enum
> instead of static const int, but this is not really convenient here as
> Dynamic is set to -1 and C++ enums are not guaranteed to be signed. It is
> still possible to set Dynamic to some very large positive value, and go for
> enums, but then there are throughout the code some "Size>0" conditions that
> would have to be changed to "Size>0 && Size<Dynamic".
>
> Cheers,
>
> Benoit
>
> On Saturday 05 January 2008 17:43:46 Christian Mayer wrote:
> > Gael Guennebaud schrieb:
> > > Also, seeing the benchmark code, I don't think that any cache miss
> > > occurs since you only have two matrices.
> >
> > "Think" is a very bad guide when it comes to performance optimization.
> > OnlÃ¶y real measurements can count as it's far too easy to make things
> > worse by over optimizing (especially true when it comes to manual loop
> > unrolling).
> >
> > You must not only take care about cache misses for the data but also
> > about cache misses for the instructions (that's where loop unrolling can
> > really bite you).
> >
> > You also must have a look at register useage which can be thought as a
> > "level 0" cache. Especially changing between row and column major can
> > make a huge difference here.
> >
> > And at the end (IIRC VTune can also tell you that) a huge performance
> > difference can be achieved by optimization of the branch prediction.
> >
> > > I would also suggest to bench with different compilers, the results
> > > might be very different. However, eigein2 is currently not compatible
> > > with ICC.
> >
> > That's sad and we should fix it ASAP. ICC is a very good compiler when
> > it comes to optimal performance. It's also quite good at auto
> > vectorisation which is crucial for SSE useage (unless you are doing it
> > by hand with intrinsics)
> >
> > CU,
> > Christian