Re: [eigen] std::complex vectorization braindump

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] std::complex vectorization braindump
From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
Date: Fri, 15 Jan 2010 09:44:46 +0100
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=28hhI9dZRdsh0S1Vznpylcd6kDlv+simCpJTI0OqEaE=; b=dbSDp12UECUu2hrj4IZ5+zTMtssJLSkcUjCwKibJNK0zkl7lu/YOg63qJtheZFXBZA XgvUpRPgRBSnNjbkw4MvRE75Dt9pNsNP0shBm/9VE8qxpaFSGb3GPqpCQXBJzc7chfjd dAKOIDf+mTajueU0rPDFiVjWwnk8Xyt+ZhjkA=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=EJmXO52H6eov9yHzD4hqNY8LBuY/k9Lw8CM7POcRRgxZSHI4gFo94kazcivKllEPoe m7MPtjVKdBMOcv9RuQCddO+zMuOrUU2aqGi97uuNHzoCSuDsfYRa4rnkuDkm9ag9wdF3 B9dRXSe1i1eES1BJNR8G/AKofk82LhUfiCEys=

On Thu, Jan 14, 2010 at 5:49 PM, Mark Borgerding <mark@xxxxxxxxxxxxxx> wrote:
> On 01/13/2010 10:08 PM, Benoit Jacob wrote:
>
> 2010/1/13 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
>
>
> I've found a certain amount of loop unrolling very beneficial to speed, even
> with SIMD.
> e.g. loading 4 SIMD registers, working on them, then storing the results can
> be much faster than doing them one at a time.
>
>
> I see. This kind of unrolling is what we called peeling and I can
> believe that in some cases it brings benefits. Normally I would like
> to keep peeling completely orthogonal to vectorization,
>
>
> If you mean the definition of "loop peeling" given at
> http://en.wikipedia.org/wiki/Loop_splitting
> That is not what I am talking about.
>
> Example code may illustrate my point. The UNROLL4 simd loop below takes
> about 25% less time than both the Eigen MapAligned path and the elementwise
> simd loop.
>
> void vector_add(float * dst,const float * src1,const float * src2,int n)
> {
>     int k=0;
>     bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) |
> ptr2int(src2) ) ) );
>     if (all_aligned) {
> # ifdef USE_EIGEN
>         VectorXf::MapAligned(dst,n) = VectorXf::MapAligned(src1,n) +
> VectorXf::MapAligned(src2,n);
>         return; // eigen takes care of the remaining samples after alignment
> ends
> # elif defined( UNROLL4 )
>         // process 16 floats per loop
>         for (; k+16<=n;k+=16) {
>             __m128 a = _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) );
>             __m128 b =
> _mm_add_ps(_mm_load_ps(src1+k+4),_mm_load_ps(src2+k+4) );
>             __m128 c =
> _mm_add_ps(_mm_load_ps(src1+k+8),_mm_load_ps(src2+k+8) );
>             __m128 d =
> _mm_add_ps(_mm_load_ps(src1+k+12),_mm_load_ps(src2+k+12) );
>             _mm_store_ps(dst+k, a);
>             _mm_store_ps(dst+k+4,b);
>             _mm_store_ps(dst+k+8, c);
>             _mm_store_ps(dst+k+12, d);
>         }
> # else
>         // one simd element ( 4 floats) at a time
>         for (; k+4<=n;k+=4)
>
> _mm_store_ps(dst+k,_mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
> # endif
>     }
>     for (;k<n;++k)
>         dst[k] = src1[k] + src2[k];
> }
>
>
> test specifics
>
> n=512 floats ( small enough to fit into cache to avoid testing memory speed)
> core2 cpu
> linux 32 bit g++ 4.4.2
> -DNDEBUG  -O3 -msse -msse2 -msse3
>
> For you assembly gurus:
>
> the elementwise simd loop and Eigen MapAligned code both compile to code
> like this
>
> .L6:
>         # basic block 5
>         movaps  (%eax,%ecx,4), %xmm0    #* src1, tmp110
>         addps   (%edx,%ecx,4), %xmm0    #* src2, tmp110
>         movaps  %xmm0, (%ebx,%ecx,4)    # tmp110,* dst
>         addl    $4, %ecx        #, index
>         cmpl    %ecx, %esi      # index, index
>         jg      .L6     #,
>
> and the unrolled simd loop compiles to
>
> .L8:
>         # basic block 5
>         movaps  (%edi,%eax,4), %xmm3    #* src1, tmp100
>         movaps  16(%edi,%eax,4), %xmm2  #, tmp103
>         movaps  32(%edi,%eax,4), %xmm1  #, tmp106
>         movaps  48(%edi,%eax,4), %xmm0  #, tmp109
>         addps   (%esi,%eax,4), %xmm3    #* src2, tmp100
>         addps   16(%esi,%eax,4), %xmm2  #, tmp103
>         addps   32(%esi,%eax,4), %xmm1  #, tmp106
>         addps   48(%esi,%eax,4), %xmm0  #, tmp109
>         movaps  %xmm3, (%ebx,%eax,4)    # tmp100,* dst
>         movaps  %xmm2, 16(%ebx,%eax,4)  # tmp103,
>         movaps  %xmm1, 32(%ebx,%eax,4)  # tmp106,
>         movaps  %xmm0, 48(%ebx,%eax,4)  # tmp109,
>         addl    $16, %eax       #, k
>         cmpl    %edx, %eax      # D.54688, k
>         jne     .L8     #,
>
>
>
>

yes I'm totally aware of this kind of optimization and as you can
guess we already use it a lot for the matrix products. However, doing
so in a general fashion, i.e., for all kind of expressions is far to
be trivial. Indeed, ideally we would simply partially unroll our
evaluation loops. Let's have a look at the default unvectorized one
(as said Benoit this problem is completely orthogonal to the
vectorization):

for(int j = 0; j < cols; ++j)
  for(int i = 0; i < rows; ++i)
    dst.coeffRef(i, j) = src.coeff(i,j);

here dst is the destination expression (can be a Matrix, a
Block<Matrix,XXX>, a Map, etc.), and src is right hand side expression
which can mix an arbitrary number of matrix and sub matrices. Note
that this is a simplified version which assumes a column major
destination matrix. A partially unrolled version would be:

for(int j = 0; j < cols; ++j)
{
  for(int i = 0; i < rows; i+=4)
  {
    dst.coeffRef(i+0, j) = src.coeff(i+0,j);
    dst.coeffRef(i+1, j) = src.coeff(i+1,j);
    dst.coeffRef(i+2, j) = src.coeff(i+2,j);
    dst.coeffRef(i+3, j) = src.coeff(i+3,j);
  }
  for(remaining coeffs i)
    dst.coeffRef(i, j) = src.coeff(i,j);
}

The problem is that when I tried that with GCC and ICC, even if src is
a simple expression, like a + b, both compilers did not manage to
generate an optimized code: the 4 unrolled operations were performed
sequentially reusing the same two registers. In this case no gain can
be expected.Yes that's a pity.

Trying to overcome this compiler limitation by generating our own
optimized code would be very very complicated because src can be any
arbitrary expression.

Also note that this kind of optimizations are worth it only if the
in/out data lies in the CPU cache. This is shown by our benchmark
(http://download.tuxfamily.org/eigen/btl-results/axpy.pdf) for Y += a
* X. Here, Eigen and GOTO (or MKL) only differs by the fact that GOTO
(or MKL) partially unroll the main loop (to be precise only two steps
are unrolled), and GOTO (or MKL) is faster only for vector sizes in
the range 300-4000.

Perhaps GCC 4.5 is better at that now ?

gael.

Follow-Ups:
- Re: [eigen] std::complex vectorization braindump
  - From: Gael Guennebaud
- Re: [eigen] std::complex vectorization braindump
  - From: Mark Borgerding

References:
- [eigen] std::complex vectorization braindump
  - From: Benoit Jacob
- Re: [eigen] std::complex vectorization braindump
  - From: Mark Borgerding
- Re: [eigen] std::complex vectorization braindump
  - From: Benoit Jacob
- Re: [eigen] std::complex vectorization braindump
  - From: Mark Borgerding

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] std::complex vectorization braindump
Next by Date: Re: [eigen] std::complex vectorization braindump
Previous by thread: Re: [eigen] std::complex vectorization braindump
Next by thread: Re: [eigen] std::complex vectorization braindump

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/