Re: [eigen] On tvmet performance

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] On tvmet performance
From: Andre Krause <post@xxxxxxxxxxxxxxxx>
Date: Wed, 29 Aug 2007 10:47:07 +0200
Organization: http://www.coreloop.com

Gael Guennebaud wrote:

Hi all,
I've seen that you are going to use expression template for vectors withfixed size via the Tiny Vector library.This puzzled me a bit because I've never seen any performance issue withmy own vector classes (classic implementation) compared to hand codedexpressions.So, I ran (again) some basic comparisons between my own implementation,tvmet and hand coded expressions.After playing a bit with Vector3f and Matrix4/Vector4 arithmeticexpressions, my conclusion is that the tvmet implementation is ALWAYS atleast slightly slower than mine and sometimes much much slower (10x).Then I'm not sure that using expression template is still a so good ideafor small vector/matrix since current compilers seems to do a very goodjob here.Moreover, I think that with a code based on tvmet it will be difficultto enable SSE optimizations...
Have you already compared the performance between Eigen1 and tvmet ?


To be precise let me show you the code of my (stupid) experiments:

    Vector3f aux, a, b, c, d;

    // vector code:
    for (uint k=0 ; k<10000000 ; ++k)
    {
        a += 1e-9f * ( (a+b)*(c+d) + (a+c)*(b+d)*(c+b) * (a-c)*(b-d)*(c-b)
             + (a*b)+(c*d) + (a*a-c)*(b+d*c)*(c*c-b) * (a*c)*(b*d)+(c*b) );
        b -= 1e-9f * a;
        c += 1e-9f * b;
        d -= 1e-9f * c;
        aux += a;
    }

   // hand coded code:
   for (uint k=0 ; k<10000000 ; ++k)
   {
#define OP(_X) a[_X] += 1e-9 * ( (a[_X]+b[_X])*(c[_X]+d[_X]) +(a[_X]+c[_X])*(b[_X]+d[_X])*(c[_X]+b[_X]) *(a[_X]-c[_X])*(b[_X]-d[_X])*(c[_X]-b[_X]) \+ (a[_X]*b[_X])+(c[_X]*d[_X]) +(a[_X]*a[_X]-c[_X])*(b[_X]+d[_X]*c[_X])*(c[_X]*c[_X]-b[_X]) *(a[_X]*c[_X])*(b[_X]*d[_X])+(c[_X]*b[_X]) ); \b[_X] -= 1e-9 * a[_X]; c[_X] += 1e-9 * b[_X]; d[_X] -= 1e-9 *c[_X]; aux[_X] += a[_X];OP(0);
        OP(1);
        OP(2);
   }

Compiler: g++ (GCC) 4.1.2,  compiled with -O3
CPU: Intel(R) Core(TM)2 CPU   T7200 (2.00 Ghz)

Results:
 - hand coded:        0.579s
 - my vector class: 0.502s
 - tvmet:                6.772s !!
Note that if I comment the second line of the first (long) expressionthen tvmet achieves closer performance (0.37s vs 0.35s).Actually with tvmet and the long expression, the ASM code contains somecall to memcpy ... a gcc issue ?
Another example (Matrix*Vector):

  Vector4f acc, a[3], b[3];
  Matrix4f m0[3], m1[3];
  for (uint k=0 ; k<50000000 ; ++k)
  {
     acc += m1[k&0x3] * ((m0[k&0x3] * a[k&0x3]) * b[k&0x3]);
  }

Results:
 - basic vector/matrix implementation: 1.24s
- tvmet: 3.17s (the ASMlooks OK)
A last one (Matrix*Matrix):

  Vector4f acc, a[3];
  Matrix4f m0[3], m1[3];
  for (uint k=0 ; k<50000000 ; ++k)
  {
     acc += (m1[k&0x3] * m0[k&0x3]) * a[k&0x3];
  }

Results:
 - basic vector/matrix implementation: 2.56s
 - tvmet:                                            2.85s



By the way, by "classic/basic implementation" I mean something like:

class Vector3f
{
float x, y, z;
inline Vector3f operator + (const Vector3f& v) const
{
    Vector3f aux;
    aux.x = x + v.x;
    aux.y = y + v.y;
    aux.z = z + v.z;
    return aux;
}
};


Gael.

very interesting. i would like to try the same with windows and visualc++ 2005. can you maybe please email me your code? i will post theresults here.

i see that you code your Vector classes using x,y,z. have you experienceif it would be slower to implement a vector / matrix class using

float data[3]; to hold the elements, and then do something like this:

inline Vector3f operator + (const Vector3f& v) const
{
    Vector3f aux;
    aux[0] = data[0] + v(0);
    aux[1] = data[1] + v(1);
    aux[2] = data[2] + v(2);
    return aux;
}

? normally i would assume that the compiler optimizes away theadditional pointer addition, but who knows what happens really...

i am just asking because i saw some one doing a Matri4x4 using floatm11, m12, ... m44 instead od an array for internal storage. maybe it isa tiny slightly bit faster?

References:
- [eigen] On tvmet performance
  - From: Gael Guennebaud

Messages sorted by: [ date | thread ]
Prev by Date: [eigen] On tvmet performance
Next by Date: Re: [eigen] On tvmet performance
Previous by thread: [eigen] On tvmet performance
Next by thread: Re: [eigen] On tvmet performance

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/