Re: [eigen] New(?) way to make using SIMD easier |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
On 11/24/2009 11:51 AM, Benoit Jacob wrote:
ah and also.
if you just want a generic easy-to-use way of performing a SIMD
operation on arrays in memory... then you can do even much simpler:
just use Map and do your operation on that. like:
VectorXf::Map(dstPtr,num)
= VectorXf::Map(srcPtr1,num)
+ VectorXf::Map(srcPtr2,num);
that compiles to just what you wanted. well except that it adds some
code to deal with unaligned boundaries; but if 'num' is known at
compile time then you avoid that by using Matrix<float,num,1> instead
of VectorXf.
I love this syntax and was excited to start to use it more in some of
our legacy code.
....
Then, I did a benchmark comparing the speed of the above to that of a
very simple C-style function using SSE(see "vector_add" in attached
testmap.cc). The simple function was *much* faster with both the intel
compiler (11.0 20081105) and with g++ (4.4.1 20090725). See the output
below.
I'm aware that this simple case does not showcase the metaprogramming
goodies that allow one to chain more complicated operations together.
With that said, why cannot Eigen come close to the speed of a simple
function when all one wants to do is add two vectors together?
The benchmark was on a 2.5GHz Core2 Duo (T9400) laptop running Fedora
11 x86 linux.
gcc output:
g++ -I.. -O3 -msse -msse2 -msse3 -c -o testmap.o testmap.cc
g++ -o testmap testmap.o
../testmap
With simple function, iterations=6000000, elements=512 took 0.66328s.
rate=4631.53 MS/s
With VectorXf::Map, iterations=6000000, elements=512 took 2.7824s.
rate=1104.08 MS/s
With simple function, iterations=6000000, elements=512 took 0.689429s.
rate=4455.86 MS/s
With VectorXf::Map, iterations=6000000, elements=512 took 2.72785s.
rate=1126.16 MS/s
intel output:
icpc -I.. -O3 -msse3 -c -o testmap.o testmap.cc
icpc -o testmap testmap.o
../testmap
With simple function, iterations=6000000, elements=512 took 0.861906s.
rate=3564.19 MS/s
With VectorXf::Map, iterations=6000000, elements=512 took 1.9696s.
rate=1559.71 MS/s
With simple function, iterations=6000000, elements=512 took 0.841174s.
rate=3652.04 MS/s
With VectorXf::Map, iterations=6000000, elements=512 took 1.85856s.
rate=1652.89 MS/s
#include <malloc.h>
#include <sys/time.h>
#include <time.h>
#include <iostream>
#include <Eigen/Core>
using namespace std;
using namespace Eigen;
inline double curtime(void)
{
struct timeval tv;
if ( gettimeofday(&tv, NULL) != 0)
perror("gettimeofday");
return (double)tv.tv_sec + (double)tv.tv_usec*.000001;
}
inline
ptrdiff_t ptr2int(const void * ptr)
{
return (ptrdiff_t)ptr;
}
void vector_add(float * dst,const float * src1,const float * src2,int n)
{
int k=0;
#ifdef __SSE__
bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) | ptr2int(src2) ) ) );
if (all_aligned) {
for (; k+4<=n;k+=4)
_mm_store_ps(dst+k, _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
}
#endif
for (;k<n;++k)
dst[k] = src1[k] + src2[k];
}
int main(int argc, char ** argv)
{
const unsigned int nel = 512;
const unsigned int nit = 6000000;
double t0,t1,t2;
float * dstPtr = (float*)memalign(16,nel*sizeof(float));
float * srcPtr1 = (float*)memalign(16,nel*sizeof(float));
float * srcPtr2 = (float*)memalign(16,nel*sizeof(float));
for (int testcase=0;testcase<4;++testcase) {
for (int k=0;k<nel;++k) {
dstPtr[k] = 0;
srcPtr1[k] = rand();
srcPtr2[k] = rand();
}
string testname;
t0 = curtime();
if (testcase&1) {
testname = "VectorXf::Map";
for (int i=0;i<nit;++i) {
VectorXf::Map(dstPtr,nel) = VectorXf::Map(srcPtr1,nel) + VectorXf::Map(srcPtr2,nel);
//srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
}
}else{
testname = "simple function";
for (int i=0;i<nit;++i) {
vector_add(dstPtr,srcPtr1,srcPtr2,nel);
//srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
}
}
t1 = curtime();
cout << " With " << testname << ", iterations=" << nit << ", elements=" << nel
<< " took " << (t1-t0) <<"s. rate=" << (1e-6*(nit*nel)/(t1-t0))<<" MS/s\n";
}
free(dstPtr);
free(srcPtr1);
free(srcPtr2);
return 0;
}