[eigen] On tvmet performance |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: [eigen] On tvmet performance
- From: "Gael Guennebaud" <gael.guennebaud@xxxxxxxxx>
- Date: Wed, 29 Aug 2007 01:43:24 +0200
- Dkim-signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type; b=c+nguVRnZev4vtpnl4g7hrZct9s/uJRrASOldyH2FtMM8v3XqGcGcyl0jwhmXOpAEPTiMNtiE4cM1Hp5McV38YdMs+GZjekgB4cevUWF60WA1X+KHABcM3Ihlxn/bixCNcMk6ei7ZSluCNPpecwuQ0rFXNmJho4zkYMALy5CTes=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type; b=l/kp1guCJMDeCtlZFKPV4XppqeJo66/vZiMAFeNn1NbvTPn2omKzhpi63yZQTDNem2oD5wBIbq/GckA1he3v2Si598K/Z6yPAEBNywnqTJKgTWd36zB/bBcesjEoWAiPKlpILt9760004QQVIWIqOzV/xww6qkZn9Wr5xt/H4zY=
Hi all,
I've seen that you are going to use _expression_ template for vectors with fixed size via the Tiny Vector library.
This puzzled me a bit because I've never seen any performance issue with my own vector classes (classic implementation) compared to hand coded expressions.
So, I ran (again) some basic comparisons between my own implementation, tvmet and hand coded expressions.
After playing a bit with Vector3f and Matrix4/Vector4 arithmetic expressions, my conclusion is that the tvmet implementation is ALWAYS at least slightly slower than mine and sometimes much much slower (10x).
Then I'm not sure that using _expression_ template is still a so good idea for small vector/matrix since current compilers seems to do a very good job here.
Moreover, I think that with a code based on tvmet it will be difficult to enable SSE optimizations...
Have you already compared the performance between Eigen1 and tvmet ?
To be precise let me show you the code of my (stupid) experiments:
Vector3f aux, a, b, c, d;
// vector code:
for (uint k=0 ; k<10000000 ; ++k)
{
a += 1e-9f * ( (a+b)*(c+d) + (a+c)*(b+d)*(c+b) * (a-c)*(b-d)*(c-b)
+ (a*b)+(c*d) + (a*a-c)*(b+d*c)*(c*c-b) * (a*c)*(b*d)+(c*b) );
b -= 1e-9f * a;
c += 1e-9f * b;
d -= 1e-9f * c;
aux += a;
}
// hand coded code:
for (uint k=0 ; k<10000000 ; ++k)
{
#define OP(_X) a[_X] += 1e-9 * ( (a[_X]+b[_X])*(c[_X]+d[_X]) + (a[_X]+c[_X])*(b[_X]+d[_X])*(c[_X]+b[_X]) * (a[_X]-c[_X])*(b[_X]-d[_X])*(c[_X]-b[_X]) \
+ (a[_X]*b[_X])+(c[_X]*d[_X]) + (a[_X]*a[_X]-c[_X])*(b[_X]+d[_X]*c[_X])*(c[_X]*c[_X]-b[_X]) * (a[_X]*c[_X])*(b[_X]*d[_X])+(c[_X]*b[_X]) ); \
b[_X] -= 1e-9 * a[_X]; c[_X] += 1e-9 * b[_X]; d[_X] -= 1e-9 * c[_X]; aux[_X] += a[_X];
OP(0);
OP(1);
OP(2);
}
Compiler: g++ (GCC) 4.1.2, compiled with -O3
CPU: Intel(R) Core(TM)2 CPU T7200 (2.00 Ghz)
Results:
- hand coded: 0.579s
- my vector class: 0.502s
- tvmet: 6.772s !!
Note that if I comment the second line of the first (long) _expression_ then tvmet achieves closer performance (0.37s vs 0.35s).
Actually with tvmet and the long _expression_, the ASM code contains some call to memcpy ... a gcc issue ?
Another example (Matrix*Vector):
Vector4f acc, a[3], b[3];
Matrix4f m0[3], m1[3];
for (uint k=0 ; k<50000000 ; ++k)
{
acc += m1[k&0x3] * ((m0[k&0x3] * a[k&0x3]) * b[k&0x3]);
}
Results:
- basic vector/matrix implementation: 1.24s
- tvmet: 3.17s (the ASM looks OK)
A last one (Matrix*Matrix):
Vector4f acc, a[3];
Matrix4f m0[3], m1[3];
for (uint k=0 ; k<50000000 ; ++k)
{
acc += (m1[k&0x3] * m0[k&0x3]) * a[k&0x3];
}
Results:
- basic vector/matrix implementation: 2.56s
- tvmet: 2.85s
By the way, by "classic/basic implementation" I mean something like:
class Vector3f
{
float x, y, z;
inline Vector3f operator + (const Vector3f& v) const
{
Vector3f aux;
aux.x = x + v.x;
aux.y = y + v.y;
aux.z = z + v.z;
return aux;
}
};
Gael.