Re: [eigen] On tvmet performance

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I noticed that in b.cpp the matrix m had inf values, which could have biased the result.

I changed the factor 0.05 to 0.005, which solved this problem. The results are the same.

Benoit

On Wed, 29 Aug 2007, Benoit Jacob wrote:

Hi Gael,

you forgot to define NDEBUG when compiling. This is extremely important for performance, and can alone be enough to account for the slowness that you noticed. So your benchmarks don't show anything!

Defining NDEBUG turns off asserts. In Eigen and in tvmet, there are tons of asserts. For instance, when you access a coordinate of a vector by index as in "vector[index]", there is an assert checking that the index falls within the allowed range. This is very slow, but useful for debugging.

With GCC, you are supposed to know that and define NDEBUG by yourself. Microsoft customers aren't assumed to be that smart so MSVC automatically defines NDEBUG when you select the "Release" mode.

I went further and made my own benchmark showing tvmet running 25% faster than Eigen1.

TVMET part:
command-line: g++ -I/home/kde4/kde/include/ a.cpp -O3 -DNDEBUG -o a
source code:

#include<iostream>
#include<tvmet/Matrix.h>
#include<tvmet/Vector.h>

using namespace std;
using namespace tvmet;

int main(int argc, char *argv[])
{
       Matrix<double,3,3> I;
       I = 1,0,0,
           0,1,0,
           0,0,1;
       Matrix<double,3,3> m;
       m = 1,2,3,
           4,5,6,
           7,8,9;
       for(int a = 0; a < 100000000; a++)
       {
               m = I + 0.05 * (m + m * m);
       }
       cout << m << endl;
       return 0;
}

Eigen1 part:
command line: g++ -I/home/kde4/kde/include/ b.cpp -O3 -DNDEBUG -o b
source code:
#include<iostream>
#include<eigen/matrix.h>

using namespace std;
using namespace Eigen;

int main(int argc, char *argv[])
{
       Matrix3d I;
       I.loadIdentity();
       Matrix3d m;
       m.loadRandom();
       for(int a = 0; a < 100000000; a++)
       {
               m = I + 0.05 * (m + m * m);
       }
       cout << m << endl;
       return 0;
}

These programs were run on my Core 1 Duo 1.66 GHZ in "performance" mode, i.e. the CPU was blocked to maximal frequency.

Result:
TVMET: 6.1 seconds
Eigen1: 8.1 seconds.

So TVMET runs approximately 25% faster than Eigen1.

Cheers
Benoit

PS. You use operator* between vectors, doing an element-wise multiplication. I have not implemented this in Eigen1 and am removing it from tvmet for Eigen2, because really it doesn't correspond to anything meaningful from the point of view of mathematics. Of course I leave dot product and cross product, but this is a different thing. And I wouldn't call either "operator*".


On Wed, 29 Aug 2007, Gael Guennebaud wrote:

Hi all,

I've seen that you are going to use expression template for vectors with
fixed size via the Tiny Vector library.
This puzzled me a bit because I've never seen any performance issue with my
own vector classes (classic implementation) compared to hand coded
expressions.
So, I ran (again) some basic comparisons between my own implementation,
tvmet and hand coded expressions.
After playing a bit with Vector3f and Matrix4/Vector4 arithmetic
expressions, my conclusion is that the tvmet implementation is ALWAYS at
least slightly slower than mine and sometimes much much slower (10x).
Then I'm not sure that using expression template is still a so good idea for
small vector/matrix since current compilers seems to do a very good job
here.
Moreover, I think that with a code based on tvmet it will be difficult to
enable SSE optimizations...
Have you already compared the performance between Eigen1 and tvmet ?


To be precise let me show you the code of my (stupid) experiments:

   Vector3f aux, a, b, c, d;

   // vector code:
   for (uint k=0 ; k<10000000 ; ++k)
   {
       a += 1e-9f * ( (a+b)*(c+d) + (a+c)*(b+d)*(c+b) * (a-c)*(b-d)*(c-b)
            + (a*b)+(c*d) + (a*a-c)*(b+d*c)*(c*c-b) * (a*c)*(b*d)+(c*b) );
       b -= 1e-9f * a;
       c += 1e-9f * b;
       d -= 1e-9f * c;
       aux += a;
   }

  // hand coded code:
  for (uint k=0 ; k<10000000 ; ++k)
  {
       #define OP(_X) a[_X] += 1e-9 * ( (a[_X]+b[_X])*(c[_X]+d[_X]) +
(a[_X]+c[_X])*(b[_X]+d[_X])*(c[_X]+b[_X]) *
(a[_X]-c[_X])*(b[_X]-d[_X])*(c[_X]-b[_X]) \
            + (a[_X]*b[_X])+(c[_X]*d[_X]) +
(a[_X]*a[_X]-c[_X])*(b[_X]+d[_X]*c[_X])*(c[_X]*c[_X]-b[_X]) *
(a[_X]*c[_X])*(b[_X]*d[_X])+(c[_X]*b[_X]) ); \
       b[_X] -= 1e-9 * a[_X];  c[_X] += 1e-9 * b[_X];  d[_X] -= 1e-9 *
c[_X]; aux[_X] += a[_X];

       OP(0);
       OP(1);
       OP(2);
  }

Compiler: g++ (GCC) 4.1.2,  compiled with -O3
CPU: Intel(R) Core(TM)2 CPU   T7200 (2.00 Ghz)

Results:
- hand coded:        0.579s
- my vector class: 0.502s
- tvmet:                6.772s !!

Note that if I comment the second line of the first (long) expression then
tvmet achieves closer performance (0.37s vs 0.35s).
Actually with tvmet and the long expression, the ASM code contains some call
to memcpy ... a gcc issue ?




Another example (Matrix*Vector):

 Vector4f acc, a[3], b[3];
 Matrix4f m0[3], m1[3];
 for (uint k=0 ; k<50000000 ; ++k)
 {
    acc += m1[k&0x3] * ((m0[k&0x3] * a[k&0x3]) * b[k&0x3]);
 }

Results:
- basic vector/matrix implementation: 1.24s
- tvmet:                                            3.17s (the ASM looks
OK)



A last one (Matrix*Matrix):

 Vector4f acc, a[3];
 Matrix4f m0[3], m1[3];
 for (uint k=0 ; k<50000000 ; ++k)
 {
    acc += (m1[k&0x3] * m0[k&0x3]) * a[k&0x3];
 }

Results:
- basic vector/matrix implementation: 2.56s
- tvmet:                                            2.85s



By the way, by "classic/basic implementation" I mean something like:

class Vector3f
{
float x, y, z;
inline Vector3f operator + (const Vector3f& v) const
{
   Vector3f aux;
   aux.x = x + v.x;
   aux.y = y + v.y;
   aux.z = z + v.z;
   return aux;
}
};


Gael.






Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/