Re: [eigen] New(?) way to make using SIMD easier

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On 11/24/2009 11:51 AM, Benoit Jacob wrote:
ah and also.

if you just want a generic easy-to-use way of performing a SIMD
operation on arrays in memory... then you can do even much simpler:
just use Map and do your operation on that. like:

VectorXf::Map(dstPtr,num)
   = VectorXf::Map(srcPtr1,num)
   + VectorXf::Map(srcPtr2,num);

that compiles to just what you wanted. well except that it adds some
code to deal with unaligned boundaries; but if 'num' is known at
compile time then you avoid that by using Matrix<float,num,1>  instead
of VectorXf.

I love this syntax and was excited to start to use it more in some of our legacy code.
....
Then, I did a benchmark comparing the speed of the above to that of a very simple C-style function using SSE(see "vector_add" in attached testmap.cc). The simple function was *much* faster with both the intel compiler (11.0 20081105) and with g++ (4.4.1 20090725). See the output below.

I'm aware that this simple case does not showcase the metaprogramming goodies that allow one to chain more complicated operations together. With that said, why cannot Eigen come close to the speed of a simple function when all one wants to do is add two vectors together?

The benchmark was on a 2.5GHz Core2 Duo (T9400) laptop running Fedora 11 x86 linux.

gcc output:
g++ -I.. -O3 -msse -msse2 -msse3    -c -o testmap.o testmap.cc
g++ -o testmap testmap.o
../testmap
With simple function, iterations=6000000, elements=512 took 0.66328s. rate=4631.53 MS/s With VectorXf::Map, iterations=6000000, elements=512 took 2.7824s. rate=1104.08 MS/s With simple function, iterations=6000000, elements=512 took 0.689429s. rate=4455.86 MS/s With VectorXf::Map, iterations=6000000, elements=512 took 2.72785s. rate=1126.16 MS/s

intel output:
icpc -I.. -O3 -msse3    -c -o testmap.o testmap.cc
icpc -o testmap testmap.o
../testmap
With simple function, iterations=6000000, elements=512 took 0.861906s. rate=3564.19 MS/s With VectorXf::Map, iterations=6000000, elements=512 took 1.9696s. rate=1559.71 MS/s With simple function, iterations=6000000, elements=512 took 0.841174s. rate=3652.04 MS/s With VectorXf::Map, iterations=6000000, elements=512 took 1.85856s. rate=1652.89 MS/s

#include <malloc.h>
#include <sys/time.h>
#include <time.h>
#include <iostream>
#include <Eigen/Core>

using namespace std;
using namespace Eigen;

inline double curtime(void)
{
    struct timeval tv;
    if ( gettimeofday(&tv, NULL) != 0)
        perror("gettimeofday");
    return (double)tv.tv_sec + (double)tv.tv_usec*.000001;
}

inline
ptrdiff_t ptr2int(const void * ptr)
{
    return (ptrdiff_t)ptr;
}

void vector_add(float * dst,const float * src1,const float * src2,int n)
{
    int k=0;
#ifdef __SSE__
    bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) | ptr2int(src2) ) ) );
    if (all_aligned) {
        for (; k+4<=n;k+=4)
            _mm_store_ps(dst+k, _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
    }
#endif
    for (;k<n;++k) 
        dst[k] = src1[k] + src2[k];
}

int main(int argc, char ** argv)
{
    const unsigned int nel = 512;
    const unsigned int nit = 6000000;
    double t0,t1,t2;
    float * dstPtr = (float*)memalign(16,nel*sizeof(float));
    float * srcPtr1 = (float*)memalign(16,nel*sizeof(float));
    float * srcPtr2 = (float*)memalign(16,nel*sizeof(float));

    for (int testcase=0;testcase<4;++testcase) {
        for (int k=0;k<nel;++k) {
            dstPtr[k] = 0;
            srcPtr1[k] = rand();
            srcPtr2[k] = rand();
        }

        string testname;
        t0 = curtime();
        if (testcase&1) {
            testname = "VectorXf::Map";
            for (int i=0;i<nit;++i) {
                VectorXf::Map(dstPtr,nel) = VectorXf::Map(srcPtr1,nel) + VectorXf::Map(srcPtr2,nel);
                //srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
            }
        }else{
            testname = "simple function";
            for (int i=0;i<nit;++i) {
                vector_add(dstPtr,srcPtr1,srcPtr2,nel);
                //srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
            }
        }
        t1 = curtime();
        cout << " With " << testname << ", iterations=" << nit << ", elements=" << nel 
            << " took " << (t1-t0) <<"s. rate=" << (1e-6*(nit*nel)/(t1-t0))<<" MS/s\n";
    }
    free(dstPtr);
    free(srcPtr1);
    free(srcPtr2);
    return 0;
}


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/