[eigen] On tvmet performance

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]



Hi all,

I've seen that you are going to use _expression_ template for vectors with fixed size via the Tiny Vector library.
This puzzled me a bit because I've never seen any performance issue with my own vector classes (classic implementation) compared to hand coded expressions.
So, I ran (again) some basic comparisons between my own implementation, tvmet and hand coded expressions.
After playing a bit with Vector3f and Matrix4/Vector4 arithmetic expressions, my conclusion is that the tvmet implementation is ALWAYS at least slightly slower than mine and sometimes much much slower (10x).
Then I'm not sure that using _expression_ template is still a so good idea for small vector/matrix since current compilers seems to do a very good job here.
Moreover, I think that with a code based on tvmet it will be difficult to enable SSE optimizations...
Have you already compared the performance between Eigen1 and tvmet ?


To be precise let me show you the code of my (stupid) experiments:

    Vector3f aux, a, b, c, d;

    // vector code:
    for (uint k=0 ; k<10000000 ; ++k)
    {
        a += 1e-9f * ( (a+b)*(c+d) + (a+c)*(b+d)*(c+b) * (a-c)*(b-d)*(c-b)
             + (a*b)+(c*d) + (a*a-c)*(b+d*c)*(c*c-b) * (a*c)*(b*d)+(c*b) );
        b -= 1e-9f * a;
        c += 1e-9f * b;
        d -= 1e-9f * c;
        aux += a;
    }

   // hand coded code:
   for (uint k=0 ; k<10000000 ; ++k)
   {
        #define OP(_X) a[_X] += 1e-9 * ( (a[_X]+b[_X])*(c[_X]+d[_X]) + (a[_X]+c[_X])*(b[_X]+d[_X])*(c[_X]+b[_X]) * (a[_X]-c[_X])*(b[_X]-d[_X])*(c[_X]-b[_X]) \
             + (a[_X]*b[_X])+(c[_X]*d[_X]) + (a[_X]*a[_X]-c[_X])*(b[_X]+d[_X]*c[_X])*(c[_X]*c[_X]-b[_X]) * (a[_X]*c[_X])*(b[_X]*d[_X])+(c[_X]*b[_X]) ); \
        b[_X] -= 1e-9 * a[_X];  c[_X] += 1e-9 * b[_X];  d[_X] -= 1e-9 * c[_X]; aux[_X] += a[_X];
 
        OP(0);
        OP(1);
        OP(2);
   }

Compiler: g++ (GCC) 4.1.2,  compiled with -O3
CPU: Intel(R) Core(TM)2 CPU   T7200 (2.00 Ghz)

Results:
 - hand coded:        0.579s
 - my vector class: 0.502s
 - tvmet:                6.772s !!

Note that if I comment the second line of the first (long) _expression_ then tvmet achieves closer performance (0.37s vs 0.35s).
Actually with tvmet and the long _expression_, the ASM code contains some call to memcpy ... a gcc issue ?




Another example (Matrix*Vector):

  Vector4f acc, a[3], b[3];
  Matrix4f m0[3], m1[3];
  for (uint k=0 ; k<50000000 ; ++k)
  {
     acc += m1[k&0x3] * ((m0[k&0x3] * a[k&0x3]) * b[k&0x3]);
  }

Results:
 - basic vector/matrix implementation: 1.24s
 - tvmet:                                            3.17s (the ASM looks OK)



A last one (Matrix*Matrix):

  Vector4f acc, a[3];
  Matrix4f m0[3], m1[3];
  for (uint k=0 ; k<50000000 ; ++k)
  {
     acc += (m1[k&0x3] * m0[k&0x3]) * a[k&0x3];
  }

Results:
 - basic vector/matrix implementation: 2.56s
 - tvmet:                                            2.85s



By the way, by "classic/basic implementation" I mean something like:

class Vector3f
{
float x, y, z;
inline Vector3f operator + (const Vector3f& v) const
{
    Vector3f aux;
    aux.x = x + v.x;
    aux.y = y + v.y;
    aux.z = z + v.z;
    return aux;
}
};


Gael.



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/