Re: [eigen] std::complex vectorization braindump

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] std::complex vectorization braindump
From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
Date: Thu, 14 Jan 2010 12:00:16 -0500
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=qJ0/nEtwQFB+hCOumhxUu7t3HpZmlH5h+wa4VnsO9Es=; b=RIq5bK/mTv9b48ZRuFYJQm9hd6lq57ICWl2uhQq6zq6Xbwv1YTxeIYMw2i/CFOUuR0 QlByolskVXpWHvzSEL+zVzl48BGJ+WtDICJZv6rrAlmMxQxHnsfmef/yM1IAL9FEVy8s xWAwa5D6uwGlBBMJcNlZz9gOJH+xVYg6hcvXc=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=gqLYMJB0hqvhvr/yugq+zPEXy1FLKs1KC4IWbJLkemMRbuLudD8halrEaNvVWug8Na VJPY6YlAhDAQB1+vuUII76lnnyuotFUFhatK+ZinnowqP6+WTKv1xrYnWDcah6t+0AJZ PdM9c2Cx6UEsNt0srCQBS52MKhlgPFpKfLRx8=

2010/1/14 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
> On 01/13/2010 10:08 PM, Benoit Jacob wrote:
>
> 2010/1/13 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
>
>
> I've found a certain amount of loop unrolling very beneficial to speed, even
> with SIMD.
> e.g. loading 4 SIMD registers, working on them, then storing the results can
> be much faster than doing them one at a time.
>
>
> I see. This kind of unrolling is what we called peeling and I can
> believe that in some cases it brings benefits. Normally I would like
> to keep peeling completely orthogonal to vectorization,
>
>
> If you mean the definition of "loop peeling" given at
> http://en.wikipedia.org/wiki/Loop_splitting
> That is not what I am talking about.

Nono, by "peeling" I really meant exactly what you describe below, not
what this wikipedia page is about. Then perhaps "peeling" is not the
right word. How about "partial unrolling" ?

Benoit


>
> Example code may illustrate my point. The UNROLL4 simd loop below takes
> about 25% less time than both the Eigen MapAligned path and the elementwise
> simd loop.
>
> void vector_add(float * dst,const float * src1,const float * src2,int n)
> {
>     int k=0;
>     bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) |
> ptr2int(src2) ) ) );
>     if (all_aligned) {
> # ifdef USE_EIGEN
>         VectorXf::MapAligned(dst,n) = VectorXf::MapAligned(src1,n) +
> VectorXf::MapAligned(src2,n);
>         return; // eigen takes care of the remaining samples after alignment
> ends
> # elif defined( UNROLL4 )
>         // process 16 floats per loop
>         for (; k+16<=n;k+=16) {
>             __m128 a = _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) );
>             __m128 b =
> _mm_add_ps(_mm_load_ps(src1+k+4),_mm_load_ps(src2+k+4) );
>             __m128 c =
> _mm_add_ps(_mm_load_ps(src1+k+8),_mm_load_ps(src2+k+8) );
>             __m128 d =
> _mm_add_ps(_mm_load_ps(src1+k+12),_mm_load_ps(src2+k+12) );
>             _mm_store_ps(dst+k, a);
>             _mm_store_ps(dst+k+4,b);
>             _mm_store_ps(dst+k+8, c);
>             _mm_store_ps(dst+k+12, d);
>         }
> # else
>         // one simd element ( 4 floats) at a time
>         for (; k+4<=n;k+=4)
>
> _mm_store_ps(dst+k,_mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
> # endif
>     }
>     for (;k<n;++k)
>         dst[k] = src1[k] + src2[k];
> }
>
>
> test specifics
>
> n=512 floats ( small enough to fit into cache to avoid testing memory speed)
> core2 cpu
> linux 32 bit g++ 4.4.2
> -DNDEBUG  -O3 -msse -msse2 -msse3
>
> For you assembly gurus:
>
> the elementwise simd loop and Eigen MapAligned code both compile to code
> like this
>
> .L6:
>         # basic block 5
>         movaps  (%eax,%ecx,4), %xmm0    #* src1, tmp110
>         addps   (%edx,%ecx,4), %xmm0    #* src2, tmp110
>         movaps  %xmm0, (%ebx,%ecx,4)    # tmp110,* dst
>         addl    $4, %ecx        #, index
>         cmpl    %ecx, %esi      # index, index
>         jg      .L6     #,
>
> and the unrolled simd loop compiles to
>
> .L8:
>         # basic block 5
>         movaps  (%edi,%eax,4), %xmm3    #* src1, tmp100
>         movaps  16(%edi,%eax,4), %xmm2  #, tmp103
>         movaps  32(%edi,%eax,4), %xmm1  #, tmp106
>         movaps  48(%edi,%eax,4), %xmm0  #, tmp109
>         addps   (%esi,%eax,4), %xmm3    #* src2, tmp100
>         addps   16(%esi,%eax,4), %xmm2  #, tmp103
>         addps   32(%esi,%eax,4), %xmm1  #, tmp106
>         addps   48(%esi,%eax,4), %xmm0  #, tmp109
>         movaps  %xmm3, (%ebx,%eax,4)    # tmp100,* dst
>         movaps  %xmm2, 16(%ebx,%eax,4)  # tmp103,
>         movaps  %xmm1, 32(%ebx,%eax,4)  # tmp106,
>         movaps  %xmm0, 48(%ebx,%eax,4)  # tmp109,
>         addl    $16, %eax       #, k
>         cmpl    %edx, %eax      # D.54688, k
>         jne     .L8     #,
>
>
>
>

References:
- [eigen] std::complex vectorization braindump
  - From: Benoit Jacob
- Re: [eigen] std::complex vectorization braindump
  - From: Mark Borgerding
- Re: [eigen] std::complex vectorization braindump
  - From: Benoit Jacob
- Re: [eigen] std::complex vectorization braindump
  - From: Mark Borgerding

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] std::complex vectorization braindump
Next by Date: Re: [eigen] std::complex vectorization braindump
Previous by thread: Re: [eigen] std::complex vectorization braindump
Next by thread: Re: [eigen] std::complex vectorization braindump

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/