Re: [eigen] benchmarking weirdness

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi,

I ran the benchmark on a Core2 with same version of GCC:

Fixed size 3x3, ColumnMajor, -DNDEBUG
real    0m11.071s
user    0m11.069s
sys    0m0.000s
 
Fixed size 3x3, ColumnMajor, with asserts
real    0m19.913s
user    0m19.885s
sys    0m0.016s
 
Fixed size 3x3, RowMajor, -DNDEBUG
real    0m11.083s
user    0m11.077s
sys    0m0.008s
 
Fixed size 3x3, RowMajor, with asserts
real    0m19.934s
user    0m19.917s
sys    0m0.016s
 
Dynamic size 20x20, ColumnMajor, -DNDEBUG
real    0m16.064s
user    0m16.057s
sys    0m0.000s
 
Dynamic size 20x20, ColumnMajor, with asserts
real    0m22.223s
user    0m22.177s
sys    0m0.036s
 
Dynamic size 20x20, RowMajor, -DNDEBUG
real    0m16.867s
user    0m16.837s
sys    0m0.028s
 
Dynamic size 20x20, RowMajor, with asserts
real    0m16.792s
user    0m16.769s
sys    0m0.016s

as you can see, here the results are much closer to what is expected. The only strange thing is about the dynamic RowMajor case where the asserts have no effect.

Also, seeing the benchmark code, I don't think that any cache miss occurs since you only have two matrices.

I would also suggest to bench with different compilers, the results might be very different. However, eigein2 is currently not compatible with ICC. The problem is that in the class MatrixBase you cannot write:
static const int RowsAtCompileTime = Derived::_RowsAtCompileTime;
because Derived is not know at the definition time. The CRT pattern works because the member function bodies are not instantiated until long after their declarations. But here you access the Derived class at the definition time and ICC does not like that. Actually I don't know how GCC succeeds to handle that !

Gael.


On Jan 5, 2008 1:57 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:
Thanks for your reply. I have repeated these tests enough time to be sure that
the performance differences are real, not just noise.
I will try to get vtune and learn to use it. I am concerned however about the
non-freeness; if it's only for the cache miss estimations, have you tried
cachegrind?

Good idea about the cache misses, although I would like to understand why a
specific matrix storage/traversal order is better than another one wrt cache
misses.

Also, I still have no idea why disabling asserts hurts performance.

Cheers,

Benoit


On Saturday 05 January 2008 13:31:01 Christian Mayer wrote:
> Hi,
>
> I didn't look at the code recently so I can't explain the measured
> results. But I can say a few things about benchmarking/measuring itself :)
>
> In German we've got a saying: "Wer misst, misst Mist"
> (Once you are measuring you'll only get crap...)
>
> The idea is basicly that you must reduce any possible errors and take
> care that the remainig error won't hit you badly... In our case I'd say
> that you should kill all processes that you don't need (especially the X
> server!), make sure that all drives are synchronized, run the benchmark
> a few times, so that it's in the HDD cache and *then* run the benchmark
> many times and save the performance. In the end throw most of the
> measured data way, only keep the values close to the median and thake
> the mean.
> This becomes especially necessary once we start to compare the speed to
> the major players like the Intel Math Kernel Library (free for Linux)
> that contains an extremely optimized BLAS an LAPACK or to ATLAS or any
> other library.
>
> As long as we are only developing along and need a trend it can be a bit
> more relaxed. But you should get a good performance monitor - which is
> basicly identical to Intel VTune:
>   http://www.intel.com/cd/software/products/asmo-na/eng/vtune/239145.htm
> (free for Linux -> "Free Non-Commercial Download")
>
> There you find out *why* one code is slower than the other. Take special
> care about cache hits and misses - they are most likely to cause the
> described behaviour (once you know that the compiler machine coder
> generation is optimal - but VTune also shows you the assembly, so you
> can check it at the same time)
>
> So get the VTune and analyze the results. Once it's clear why one result
> is slower than the other we can try to optimize it.
>
> CU,
> Christian
>
> Benoît Jacob schrieb:
> > Hi List
> >
> > A lot of progress has happened since alpha1 -- much more than I expected
> > to remain to be done. I'll write more about this later, but now I would
> > like to discuss benchmarking.
> >
> > We now have two benchmarks in doc/ : benchmark.cpp is our traditional
> > benchmark on 3x3 fixed-size matrices, and benchmarkX.cpp is a 20x20
> > dynamic size variant.
> >
> > There is also a script, benchmark_suite, running these benchmarks several
> > times with various compile options:
> > *with and without -DNDEBUG (disabling asserts)
> > *with matrix storage order set to RowMajor and ColumnMajor
> >
> > I should insist on the fact that the matrix storage order influences not
> > only the storage of coefficients, but also the traversal order when e.g.
> > copying matrices. Expressions are recursively aware of the preferred
> > traversal order.
> >
> > The reason why I'm writing this is that this benchmark_suite gives me
> > some very unexpected results:
> >
> > gaston@kiwi:~/cuisine/branches/work/eigen2/doc$ g++ --version
> > g++ (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
> > gaston@kiwi:~/cuisine/branches/work/eigen2/doc$ ./benchmark_suite
> > Fixed size 3x3, ColumnMajor, -DNDEBUG
> >
> > real    0m19.942s
> > user    0m19.893s
> > sys     0m0.024s
> > Fixed size 3x3, ColumnMajor, with asserts
> >
> > real    0m32.434s
> > user    0m32.406s
> > sys     0m0.008s
> > Fixed size 3x3, RowMajor, -DNDEBUG
> >
> > real    0m21.497s
> > user    0m21.497s
> > sys     0m0.000s
> > Fixed size 3x3, RowMajor, with asserts
> >
> > real    0m32.133s
> > user    0m32.122s
> > sys     0m0.012s
> > Dynamic size 20x20, ColumnMajor, -DNDEBUG
> >
> > real    0m33.014s
> > user    0m33.006s
> > sys     0m0.000s
> > Dynamic size 20x20, ColumnMajor, with asserts
> >
> > real    0m27.599s
> > user     0m27.554s
> > sys     0m0.024s
> > Dynamic size 20x20, RowMajor, -DNDEBUG
> >
> > real    0m28.343s
> > user    0m28.342s
> > sys     0m0.000s
> > Dynamic size 20x20, RowMajor, with asserts
> >
> > real    0m26.597s
> > user    0m26.562s
> > sys     0m0.012s
> >
> > We see two strange things here, which I can't explain.
> >
> > First, with dynamicsize 20x20, disabling asserts (defining NDEBUG)
> > REDUCES speed! What's going on?
> >
> > First, the storage order has a nonnegligible impact. More precisely, with
> > 3x3 fixedsize, ColumnMajor is almost 10% faster than RowMajor, while with
> > 20x20 dynamicsize, RowMajor is faster than ColumnMajor! Also, how to
> > explain the fact that RowMajor suffers less than ColumnMajor from the
> > slowdown induced by defining NDEBUG ?
> >
> > All this is in SVN so please help me!
> >
> > Cheers,
> > Benoit





Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/