Re: [eigen] benchmarking weirdness

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi,

I didn't look at the code recently so I can't explain the measured
results. But I can say a few things about benchmarking/measuring itself :)

In German we've got a saying: "Wer misst, misst Mist"
(Once you are measuring you'll only get crap...)

The idea is basicly that you must reduce any possible errors and take
care that the remainig error won't hit you badly... In our case I'd say
that you should kill all processes that you don't need (especially the X
server!), make sure that all drives are synchronized, run the benchmark
a few times, so that it's in the HDD cache and *then* run the benchmark
many times and save the performance. In the end throw most of the
measured data way, only keep the values close to the median and thake
the mean.
This becomes especially necessary once we start to compare the speed to
the major players like the Intel Math Kernel Library (free for Linux)
that contains an extremely optimized BLAS an LAPACK or to ATLAS or any
other library.

As long as we are only developing along and need a trend it can be a bit
more relaxed. But you should get a good performance monitor - which is
basicly identical to Intel VTune:
  http://www.intel.com/cd/software/products/asmo-na/eng/vtune/239145.htm
(free for Linux -> "Free Non-Commercial Download")

There you find out *why* one code is slower than the other. Take special
care about cache hits and misses - they are most likely to cause the
described behaviour (once you know that the compiler machine coder
generation is optimal - but VTune also shows you the assembly, so you
can check it at the same time)

So get the VTune and analyze the results. Once it's clear why one result
is slower than the other we can try to optimize it.

CU,
Christian

Benoît Jacob schrieb:
> Hi List
> 
> A lot of progress has happened since alpha1 -- much more than I expected to 
> remain to be done. I'll write more about this later, but now I would like to 
> discuss benchmarking.
> 
> We now have two benchmarks in doc/ : benchmark.cpp is our traditional 
> benchmark on 3x3 fixed-size matrices, and benchmarkX.cpp is a 20x20 dynamic 
> size variant.
> 
> There is also a script, benchmark_suite, running these benchmarks several 
> times with various compile options:
> *with and without -DNDEBUG (disabling asserts)
> *with matrix storage order set to RowMajor and ColumnMajor
> 
> I should insist on the fact that the matrix storage order influences not only 
> the storage of coefficients, but also the traversal order when e.g. copying 
> matrices. Expressions are recursively aware of the preferred traversal order.
> 
> The reason why I'm writing this is that this benchmark_suite gives me some 
> very unexpected results:
> 
> gaston@kiwi:~/cuisine/branches/work/eigen2/doc$ g++ --version
> g++ (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
> gaston@kiwi:~/cuisine/branches/work/eigen2/doc$ ./benchmark_suite
> Fixed size 3x3, ColumnMajor, -DNDEBUG
> 
> real    0m19.942s
> user    0m19.893s
> sys     0m0.024s
> Fixed size 3x3, ColumnMajor, with asserts
> 
> real    0m32.434s
> user    0m32.406s
> sys     0m0.008s
> Fixed size 3x3, RowMajor, -DNDEBUG
> 
> real    0m21.497s
> user    0m21.497s
> sys     0m0.000s
> Fixed size 3x3, RowMajor, with asserts
> 
> real    0m32.133s
> user    0m32.122s
> sys     0m0.012s
> Dynamic size 20x20, ColumnMajor, -DNDEBUG
> 
> real    0m33.014s
> user    0m33.006s
> sys     0m0.000s
> Dynamic size 20x20, ColumnMajor, with asserts
> 
> real    0m27.599s
> user    0m27.554s
> sys     0m0.024s
> Dynamic size 20x20, RowMajor, -DNDEBUG
> 
> real    0m28.343s
> user    0m28.342s
> sys     0m0.000s
> Dynamic size 20x20, RowMajor, with asserts
> 
> real    0m26.597s
> user    0m26.562s
> sys     0m0.012s
> 
> We see two strange things here, which I can't explain.
> 
> First, with dynamicsize 20x20, disabling asserts (defining NDEBUG) REDUCES 
> speed! What's going on?
> 
> First, the storage order has a nonnegligible impact. More precisely, with 3x3 
> fixedsize, ColumnMajor is almost 10% faster than RowMajor, while with 20x20 
> dynamicsize, RowMajor is faster than ColumnMajor! Also, how to explain the 
> fact that RowMajor suffers less than ColumnMajor from the slowdown induced by 
> defining NDEBUG ?
> 
> All this is in SVN so please help me!
> 
> Cheers,
> Benoit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHf3iFoWM1JLkHou0RCDnOAJ9RIJWOwYlWesSGSdkivZZbIeIrRwCfWdtg
HRqF47JVRScvTxHs3O3Xvlk=
=uqSo
-----END PGP SIGNATURE-----



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/