Re: [eigen] New(?) way to make using SIMD easier

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2009/11/25 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
> On 11/24/2009 11:51 AM, Benoit Jacob wrote:
>>
>> ah and also.
>>
>> if you just want a generic easy-to-use way of performing a SIMD
>> operation on arrays in memory... then you can do even much simpler:
>> just use Map and do your operation on that. like:
>>
>> VectorXf::Map(dstPtr,num)
>>   = VectorXf::Map(srcPtr1,num)
>>   + VectorXf::Map(srcPtr2,num);
>>
>> that compiles to just what you wanted. well except that it adds some
>> code to deal with unaligned boundaries; but if 'num' is known at
>> compile time then you avoid that by using Matrix<float,num,1>  instead
>> of VectorXf.
>>
>>
>
> I love this syntax and was excited to start to use it more in some of our
> legacy code.
> ...
> Then,  I did a benchmark  comparing the speed of the above to that of a very
>  simple C-style function using SSE(see "vector_add" in attached testmap..cc).
> The simple function was *much* faster with both the intel compiler (11.0
> 20081105)  and with g++ (4.4.1 20090725). See the output below.

That's because I forgot to tell you that when the pointers are known
to be aligned, you need to tell that to Eigen, otherwise it can't
guess it (at least not without incurring a constant overhead).

So just use MapAligned() instead of Map()  (note: that requires the
development branch). Actually I tried and now it has exactly the same
speed as your simple version:

$ g++ testmap.cc -I ../eigen -O2 -DNDEBUG -o t && ./t
 With simple function, iterations=6000000, elements=512 took 1.39999s.
rate=2194.3 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 1.40002s.
rate=2194.25 MS/s
 With simple function, iterations=6000000, elements=512 took 1.39974s.
rate=2194.69 MS/s
 With VectorXf::Map, iterations=6000000, elements=512 took 1.39972s.
rate=2194.72 MS/s

> I'm aware that this simple case does not showcase the metaprogramming
> goodies that allow one to chain more complicated operations together.  With
> that said, why cannot Eigen come close to the speed of a simple function
> when all one wants to do is add two vectors together?

As you see, it can, provided you give it the same amount of
information at compile-time ;)

Benoit
#include <malloc.h>
#include <sys/time.h>
#include <time.h>
#include <iostream>
#include <Eigen/Core>

using namespace std;
using namespace Eigen;

inline double curtime(void)
{
    struct timeval tv;
    if ( gettimeofday(&tv, NULL) != 0)
        perror("gettimeofday");
    return (double)tv.tv_sec + (double)tv.tv_usec*.000001;
}

inline
ptrdiff_t ptr2int(const void * ptr)
{
    return (ptrdiff_t)ptr;
}

void vector_add(float * dst,const float * src1,const float * src2,int n)
{
    int k=0;
#ifdef __SSE__
    bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) | ptr2int(src2) ) ) );
    if (all_aligned) {
        for (; k+4<=n;k+=4)
            _mm_store_ps(dst+k, _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
    }
#endif
    for (;k<n;++k) 
        dst[k] = src1[k] + src2[k];
}

int main(int argc, char ** argv)
{
    const unsigned int nel = 512;
    const unsigned int nit = 6000000;
    double t0,t1,t2;
    float * dstPtr = (float*)memalign(16,nel*sizeof(float));
    float * srcPtr1 = (float*)memalign(16,nel*sizeof(float));
    float * srcPtr2 = (float*)memalign(16,nel*sizeof(float));

    for (int testcase=0;testcase<4;++testcase) {
        for (int k=0;k<nel;++k) {
            dstPtr[k] = 0;
            srcPtr1[k] = rand();
            srcPtr2[k] = rand();
        }

        string testname;
        t0 = curtime();
        if (testcase&1) {
            testname = "VectorXf::Map";
            for (int i=0;i<nit;++i) {
                VectorXf::MapAligned(dstPtr,nel) = VectorXf::MapAligned(srcPtr1,nel) + VectorXf::MapAligned(srcPtr2,nel);
                //srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
            }
        }else{
            testname = "simple function";
            for (int i=0;i<nit;++i) {
                vector_add(dstPtr,srcPtr1,srcPtr2,nel);
                //srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
            }
        }
        t1 = curtime();
        cout << " With " << testname << ", iterations=" << nit << ", elements=" << nel 
            << " took " << (t1-t0) <<"s. rate=" << (1e-6*(nit*nel)/(t1-t0))<<" MS/s\n";
    }
    free(dstPtr);
    free(srcPtr1);
    free(srcPtr2);
    return 0;
}


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/