Re: [eigen] New(?) way to make using SIMD easier |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] New(?) way to make using SIMD easier
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Wed, 25 Nov 2009 16:08:57 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=qMHZoXMVHG+bwOwynvCQ7nm1U921op43uvqh5fUUqJI=; b=pXLhZ4ii73ZXBPi8YiQY6BHPRub+MEp5/FY5jBE918hPOZV4BU54xvLXXeYPcWJuYe 7ilQZHptlt0sCnOjOQtLodOSrKfp/9LKcW6sYIT2BxjqH0uIKV3U8xVEGJ6mFf6D9H68 BAALibNiJsG12FF61aI2K4rRMHmk+Nc0oG3LQ=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=AlPaFt7j6U/bjVazCsl1WUyrhav+xr/yFRjloIQT5hLTAx9sLubYJuTJTYf5rNcxW7 PtauFPLMlTQu+slkyWsZ/xeRImiwJ8RNO0nBGlhox7LzZW3/D91iZ6W3QBLs6t1jlSUT Pewqok4jpbVD5h8JtLzA5aqnddcLz268sGvVE=
2009/11/25 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
> On 11/25/2009 12:43 PM, Benoit Jacob wrote:
>>
>> 2009/11/25 Mark Borgerding<mark@xxxxxxxxxxxxxx>:
>>
>>>
>>> On 11/24/2009 11:51 AM, Benoit Jacob wrote:
>>>
>>>>
>>>> VectorXf::Map(dstPtr,num)
>>>> = VectorXf::Map(srcPtr1,num)
>>>> + VectorXf::Map(srcPtr2,num);
>>>>
>>>
>>> I love this syntax and was excited to start to use it more in some of our
>>> legacy code.
>>> ...
>>> Then, I did a benchmark comparing the speed of the above to that of a
>>> very
>>> simple C-style function using SSE(see "vector_add" in attached
>>> testmap.cc).
>>> The simple function was *much* faster with both the intel compiler (11.0
>>> 20081105) and with g++ (4.4.1 20090725). See the output below.
>>>
>>
>> That's because I forgot to tell you that when the pointers are known
>> to be aligned, you need to tell that to Eigen, otherwise it can't
>> guess it (at least not without incurring a constant overhead).
>>
>> So just use MapAligned() instead of Map() (note: that requires the
>> development branch). Actually I tried and now it has exactly the same
>> speed as your simple version:
>>
>> $ g++ testmap.cc -I ../eigen -O2 -DNDEBUG -o t&& ./t
>>
>
> You did not use any -msse* flags. So neither version is using SIMD.
I am on a 64-bit machine, so SSE2 is implicit. Both versions are using
SIMD. Actually here is the assembly generated by the Eigen version:
xorl %eax, %eax
.p2align 4,,10
.p2align 3
..L21:
movaps (%rbp,%rax), %xmm0
addps (%r12,%rax), %xmm0
movaps %xmm0, (%rbx,%rax)
addq $16, %rax
cmpq $2048, %rax
jne .L21
> After switching to MapAligned ( from hg tip), it helped a little, but I
> still see almost a 2x difference.
>
> g++ -I.. -O3 -msse -msse2 -msse3 -c -o testmap.o testmap.cc
> g++ -o testmap testmap.o
> ./testmap
> With simple function, iterations=6000000, elements=512 took 0.690981s.
> rate=4445.85 MS/s
> With VectorXf::Map, iterations=6000000, elements=512 took 1.29193s..
> rate=2377.84 MS/s
> With simple function, iterations=6000000, elements=512 took 0.671556s.
> rate=4574.45 MS/s
> With VectorXf::Map, iterations=6000000, elements=512 took 1.27064s..
> rate=2417.67 MS/s
Strange. I can't reproduce this here, although i too have gcc 4.4,
even using the exact same command lines as you do.
Can you try -DNDEBUG ? Here it makes a small but noticeable difference.
Otherwise the most likely explanation is the difference between x86
and x86-64. Can you generate the asm and send it? Find attached a
modified source file to emit asm comments at the right place (like i
used above).
Cheers
Benoit
#include <malloc.h>
#include <sys/time.h>
#include <time.h>
#include <iostream>
#include <Eigen/Core>
using namespace std;
using namespace Eigen;
inline double curtime(void)
{
struct timeval tv;
if ( gettimeofday(&tv, NULL) != 0)
perror("gettimeofday");
return (double)tv.tv_sec + (double)tv.tv_usec*.000001;
}
inline
ptrdiff_t ptr2int(const void * ptr)
{
return (ptrdiff_t)ptr;
}
void vector_add(float * dst,const float * src1,const float * src2,int n)
{
int k=0;
#ifdef __SSE__
bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) | ptr2int(src2) ) ) );
if (all_aligned) {
for (; k+4<=n;k+=4)
_mm_store_ps(dst+k, _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
}
#endif
for (;k<n;++k)
dst[k] = src1[k] + src2[k];
}
int main(int argc, char ** argv)
{
const unsigned int nel = 512;
const unsigned int nit = 6000000;
double t0,t1,t2;
float * dstPtr = (float*)memalign(16,nel*sizeof(float));
float * srcPtr1 = (float*)memalign(16,nel*sizeof(float));
float * srcPtr2 = (float*)memalign(16,nel*sizeof(float));
for (int testcase=0;testcase<4;++testcase) {
for (int k=0;k<nel;++k) {
dstPtr[k] = 0;
srcPtr1[k] = rand();
srcPtr2[k] = rand();
}
string testname;
t0 = curtime();
if (testcase&1) {
testname = "VectorXf::Map";
for (int i=0;i<nit;++i) {
EIGEN_ASM_COMMENT("begin eigen");
VectorXf::MapAligned(dstPtr,nel) = VectorXf::MapAligned(srcPtr1,nel) + VectorXf::MapAligned(srcPtr2,nel);
//srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
EIGEN_ASM_COMMENT("end eigen");
}
}else{
testname = "simple function";
for (int i=0;i<nit;++i) {
vector_add(dstPtr,srcPtr1,srcPtr2,nel);
//srcPtr1[i&(nel-1)] = dstPtr[0]; // trick the compiler from knowing that it is doing the same thing over and over
}
}
t1 = curtime();
cout << " With " << testname << ", iterations=" << nit << ", elements=" << nel
<< " took " << (t1-t0) <<"s. rate=" << (1e-6*(nit*nel)/(t1-t0))<<" MS/s\n";
}
free(dstPtr);
free(srcPtr1);
free(srcPtr2);
return 0;
}