Re: [eigen] New(?) way to make using SIMD easier

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] New(?) way to make using SIMD easier
From: Mark Borgerding <mark@xxxxxxxxxxxxxx>
Date: Wed, 25 Nov 2009 12:32:50 -0500

On 11/24/2009 11:51 AM, Benoit Jacob wrote:

ah and also.

if you just want a generic easy-to-use way of performing a SIMD
operation on arrays in memory... then you can do even much simpler:
just use Map and do your operation on that. like:

VectorXf::Map(dstPtr,num)
   = VectorXf::Map(srcPtr1,num)
   + VectorXf::Map(srcPtr2,num);

that compiles to just what you wanted. well except that it adds some
code to deal with unaligned boundaries; but if 'num' is known at
compile time then you avoid that by using Matrix<float,num,1>  instead
of VectorXf.

I love this syntax and was excited to start to use it more in some ofour legacy code.

....

Then, I did a benchmark comparing the speed of the above to that of avery simple C-style function using SSE(see "vector_add" in attachedtestmap.cc). The simple function was *much* faster with both the intelcompiler (11.0 20081105) and with g++ (4.4.1 20090725). See the outputbelow.

I'm aware that this simple case does not showcase the metaprogramminggoodies that allow one to chain more complicated operations together.With that said, why cannot Eigen come close to the speed of a simplefunction when all one wants to do is add two vectors together?

The benchmark was on a 2.5GHz Core2 Duo (T9400) laptop running Fedora11 x86 linux.


gcc output:
g++ -I.. -O3 -msse -msse2 -msse3    -c -o testmap.o testmap.cc
g++ -o testmap testmap.o
../testmap

With simple function, iterations=6000000, elements=512 took 0.66328s.rate=4631.53 MS/sWith VectorXf::Map, iterations=6000000, elements=512 took 2.7824s.rate=1104.08 MS/sWith simple function, iterations=6000000, elements=512 took 0.689429s.rate=4455.86 MS/sWith VectorXf::Map, iterations=6000000, elements=512 took 2.72785s.rate=1126.16 MS/s


intel output:
icpc -I.. -O3 -msse3    -c -o testmap.o testmap.cc
icpc -o testmap testmap.o
../testmap

With simple function, iterations=6000000, elements=512 took 0.861906s.rate=3564.19 MS/sWith VectorXf::Map, iterations=6000000, elements=512 took 1.9696s.rate=1559.71 MS/sWith simple function, iterations=6000000, elements=512 took 0.841174s.rate=3652.04 MS/sWith VectorXf::Map, iterations=6000000, elements=512 took 1.85856s.rate=1652.89 MS/s

#include <malloc.h>
#include <sys/time.h>
#include <time.h>
#include <iostream>
#include <Eigen/Core>

using namespace std;
using namespace Eigen;

inline double curtime(void)
{
    struct timeval tv;
    if ( gettimeofday(&tv, NULL) != 0)
        perror("gettimeofday");
    return (double)tv.tv_sec + (double)tv.tv_usec*.000001;
}

inline
ptrdiff_t ptr2int(const void * ptr)
{
    return (ptrdiff_t)ptr;
}

void vector_add(float * dst,const float * src1,const float * src2,int n)
{
    int k=0;
#ifdef __SSE__
    bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) | ptr2int(src2) ) ) );
    if (all_aligned) {
        for (; k+4<=n;k+=4)
            _mm_store_ps(dst+k, _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
    }
#endif
    for (;k<n;++k) 
        dst[k] = src1[k] + src2[k];
}

int main(int argc, char ** argv)
{
    const unsigned int nel = 512;
    const unsigned int nit = 6000000;
    double t0,t1,t2;
    float * dstPtr = (float*)memalign(16,nel*sizeof(float));
    float * srcPtr1 = (float*)memalign(16,nel*sizeof(float));
    float * srcPtr2 = (float*)memalign(16,nel*sizeof(float));

    for (int testcase=0;testcase<4;++testcase) {
        for (int k=0;k<nel;++k) {
            dstPtr[k] = 0;
            srcPtr1[k] = rand();
            srcPtr2[k] = rand();
        }

        string testname;
        t0 = curtime();
        if (testcase&1) {
            testname = "VectorXf::Map";
            for (int i=0;i<nit;++i) {
                VectorXf::Map(dstPtr,nel) = VectorXf::Map(srcPtr1,nel) + VectorXf::Map(srcPtr2,nel);
                //srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
            }
        }else{
            testname = "simple function";
            for (int i=0;i<nit;++i) {
                vector_add(dstPtr,srcPtr1,srcPtr2,nel);
                //srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
            }
        }
        t1 = curtime();
        cout << " With " << testname << ", iterations=" << nit << ", elements=" << nel 
            << " took " << (t1-t0) <<"s. rate=" << (1e-6*(nit*nel)/(t1-t0))<<" MS/s\n";
    }
    free(dstPtr);
    free(srcPtr1);
    free(srcPtr2);
    return 0;
}

Follow-Ups:
- Re: [eigen] New(?) way to make using SIMD easier
  - From: Benoit Jacob

References:
- [eigen] New(?) way to make using SIMD easier
  - From: Mark Borgerding
- Re: [eigen] New(?) way to make using SIMD easier
  - From: Benoit Jacob
- Re: [eigen] New(?) way to make using SIMD easier
  - From: Benoit Jacob

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Vectorized Hamming distance
Next by Date: Re: [eigen] New(?) way to make using SIMD easier
Previous by thread: Re: [eigen] New(?) way to make using SIMD easier
Next by thread: Re: [eigen] New(?) way to make using SIMD easier

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/