Hi all,
I've seen that you are going to use expression template for vectors with
fixed size via the Tiny Vector library.
This puzzled me a bit because I've never seen any performance issue with
my own vector classes (classic implementation) compared to hand coded
expressions.
So, I ran (again) some basic comparisons between my own implementation,
tvmet and hand coded expressions.
After playing a bit with Vector3f and Matrix4/Vector4 arithmetic
expressions, my conclusion is that the tvmet implementation is ALWAYS at
least slightly slower than mine and sometimes much much slower (10x).
Then I'm not sure that using expression template is still a so good idea
for small vector/matrix since current compilers seems to do a very good
job here.
Moreover, I think that with a code based on tvmet it will be difficult
to enable SSE optimizations...
Have you already compared the performance between Eigen1 and tvmet ?
To be precise let me show you the code of my (stupid) experiments:
Vector3f aux, a, b, c, d;
// vector code:
for (uint k=0 ; k<10000000 ; ++k)
{
a += 1e-9f * ( (a+b)*(c+d) + (a+c)*(b+d)*(c+b) * (a-c)*(b-d)*(c-b)
+ (a*b)+(c*d) + (a*a-c)*(b+d*c)*(c*c-b) * (a*c)*(b*d)+(c*b) );
b -= 1e-9f * a;
c += 1e-9f * b;
d -= 1e-9f * c;
aux += a;
}
// hand coded code:
for (uint k=0 ; k<10000000 ; ++k)
{
#define OP(_X) a[_X] += 1e-9 * ( (a[_X]+b[_X])*(c[_X]+d[_X]) +
(a[_X]+c[_X])*(b[_X]+d[_X])*(c[_X]+b[_X]) *
(a[_X]-c[_X])*(b[_X]-d[_X])*(c[_X]-b[_X]) \
+ (a[_X]*b[_X])+(c[_X]*d[_X]) +
(a[_X]*a[_X]-c[_X])*(b[_X]+d[_X]*c[_X])*(c[_X]*c[_X]-b[_X]) *
(a[_X]*c[_X])*(b[_X]*d[_X])+(c[_X]*b[_X]) ); \
b[_X] -= 1e-9 * a[_X]; c[_X] += 1e-9 * b[_X]; d[_X] -= 1e-9 *
c[_X]; aux[_X] += a[_X];
OP(0);
OP(1);
OP(2);
}
Compiler: g++ (GCC) 4.1.2, compiled with -O3
CPU: Intel(R) Core(TM)2 CPU T7200 (2.00 Ghz)
Results:
- hand coded: 0.579s
- my vector class: 0.502s
- tvmet: 6.772s !!
Note that if I comment the second line of the first (long) expression
then tvmet achieves closer performance (0.37s vs 0.35s).
Actually with tvmet and the long expression, the ASM code contains some
call to memcpy ... a gcc issue ?
Another example (Matrix*Vector):
Vector4f acc, a[3], b[3];
Matrix4f m0[3], m1[3];
for (uint k=0 ; k<50000000 ; ++k)
{
acc += m1[k&0x3] * ((m0[k&0x3] * a[k&0x3]) * b[k&0x3]);
}
Results:
- basic vector/matrix implementation: 1.24s
- tvmet: 3.17s (the ASM
looks OK)
A last one (Matrix*Matrix):
Vector4f acc, a[3];
Matrix4f m0[3], m1[3];
for (uint k=0 ; k<50000000 ; ++k)
{
acc += (m1[k&0x3] * m0[k&0x3]) * a[k&0x3];
}
Results:
- basic vector/matrix implementation: 2.56s
- tvmet: 2.85s
By the way, by "classic/basic implementation" I mean something like:
class Vector3f
{
float x, y, z;
inline Vector3f operator + (const Vector3f& v) const
{
Vector3f aux;
aux.x = x + v.x;
aux.y = y + v.y;
aux.z = z + v.z;
return aux;
}
};
Gael.