Re: [eigen] benchmarking weirdness

On Jan 5, 2008 1:57 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:

Thanks for your reply. I have repeated these tests enough time to be sure that
the performance differences are real, not just noise.
I will try to get vtune and learn to use it. I am concerned however about the
non-freeness; if it's only for the cache miss estimations, have you tried
cachegrind?

Good idea about the cache misses, although I would like to understand why a
specific matrix storage/traversal order is better than another one wrt cache
misses.

Also, I still have no idea why disabling asserts hurts performance.

Cheers,

Benoit

On Saturday 05 January 2008 13:31:01 Christian Mayer wrote:
> Hi,
>
> I didn't look at the code recently so I can't explain the measured
> results. But I can say a few things about benchmarking/measuring itself :)
>
> In German we've got a saying: "Wer misst, misst Mist"
> (Once you are measuring you'll only get crap...)
>
> The idea is basicly that you must reduce any possible errors and take
> care that the remainig error won't hit you badly... In our case I'd say
> that you should kill all processes that you don't need (especially the X
> server!), make sure that all drives are synchronized, run the benchmark
> a few times, so that it's in the HDD cache and *then* run the benchmark
> many times and save the performance. In the end throw most of the
> measured data way, only keep the values close to the median and thake
> the mean.
> This becomes especially necessary once we start to compare the speed to
> the major players like the Intel Math Kernel Library (free for Linux)
> that contains an extremely optimized BLAS an LAPACK or to ATLAS or any
> other library.
>
> As long as we are only developing along and need a trend it can be a bit
> more relaxed. But you should get a good performance monitor - which is
> basicly identical to Intel VTune:
> http://www.intel.com/cd/software/products/asmo-na/eng/vtune/239145.htm
> (free for Linux -> "Free Non-Commercial Download")
>
> There you find out *why* one code is slower than the other. Take special
> care about cache hits and misses - they are most likely to cause the
> described behaviour (once you know that the compiler machine coder
> generation is optimal - but VTune also shows you the assembly, so you
> can check it at the same time)
>
> So get the VTune and analyze the results. Once it's clear why one result
> is slower than the other we can try to optimize it.
>
> CU,
> Christian
>
> Benoît Jacob schrieb:
> > Hi List
> >
> > A lot of progress has happened since alpha1 -- much more than I expected
> > to remain to be done. I'll write more about this later, but now I would
> > like to discuss benchmarking.
> >
> > We now have two benchmarks in doc/ : benchmark.cpp is our traditional
> > benchmark on 3x3 fixed-size matrices, and benchmarkX.cpp is a 20x20
> > dynamic size variant.
> >
> > There is also a script, benchmark_suite, running these benchmarks several
> > times with various compile options:
> > *with and without -DNDEBUG (disabling asserts)
> > *with matrix storage order set to RowMajor and ColumnMajor
> >
> > I should insist on the fact that the matrix storage order influences not
> > only the storage of coefficients, but also the traversal order when e.g.
> > copying matrices. Expressions are recursively aware of the preferred
> > traversal order.
> >
> > The reason why I'm writing this is that this benchmark_suite gives me
> > some very unexpected results:
> >
> > gaston@kiwi:~/cuisine/branches/work/eigen2/doc$ g++ --version
> > g++ (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
> > gaston@kiwi:~/cuisine/branches/work/eigen2/doc$ ./benchmark_suite
> > Fixed size 3x3, ColumnMajor, -DNDEBUG
> >
> > real 0m19.942s
> > user 0m19.893s
> > sys 0m0.024s
> > Fixed size 3x3, ColumnMajor, with asserts
> >
> > real 0m32.434s
> > user 0m32.406s
> > sys 0m0.008s
> > Fixed size 3x3, RowMajor, -DNDEBUG
> >
> > real 0m21.497s
> > user 0m21.497s
> > sys 0m0.000s
> > Fixed size 3x3, RowMajor, with asserts
> >
> > real 0m32.133s
> > user 0m32.122s
> > sys 0m0.012s
> > Dynamic size 20x20, ColumnMajor, -DNDEBUG
> >
> > real 0m33.014s
> > user 0m33.006s
> > sys 0m0.000s
> > Dynamic size 20x20, ColumnMajor, with asserts
> >
> > real 0m27.599s
> > user 0m27.554s
> > sys 0m0.024s
> > Dynamic size 20x20, RowMajor, -DNDEBUG
> >
> > real 0m28.343s
> > user 0m28.342s
> > sys 0m0.000s
> > Dynamic size 20x20, RowMajor, with asserts
> >
> > real 0m26.597s
> > user 0m26.562s
> > sys 0m0.012s
> >
> > We see two strange things here, which I can't explain.
> >
> > First, with dynamicsize 20x20, disabling asserts (defining NDEBUG)
> > REDUCES speed! What's going on?
> >
> > First, the storage order has a nonnegligible impact. More precisely, with
> > 3x3 fixedsize, ColumnMajor is almost 10% faster than RowMajor, while with
> > 20x20 dynamicsize, RowMajor is faster than ColumnMajor! Also, how to
> > explain the fact that RowMajor suffers less than ColumnMajor from the
> > slowdown induced by defining NDEBUG ?
> >
> > All this is in SVN so please help me!
> >
> > Cheers,
> > Benoit