Re: [eigen] std::complex vectorization braindump

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Fri, Jan 15, 2010 at 9:44 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> On Thu, Jan 14, 2010 at 5:49 PM, Mark Borgerding <mark@xxxxxxxxxxxxxx> wrote:
>> On 01/13/2010 10:08 PM, Benoit Jacob wrote:
>>
>> 2010/1/13 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
>>
>>
>> I've found a certain amount of loop unrolling very beneficial to speed, even
>> with SIMD.
>> e.g. loading 4 SIMD registers, working on them, then storing the results can
>> be much faster than doing them one at a time.
>>
>>
>> I see. This kind of unrolling is what we called peeling and I can
>> believe that in some cases it brings benefits. Normally I would like
>> to keep peeling completely orthogonal to vectorization,
>>
>>
>> If you mean the definition of "loop peeling" given at
>> http://en.wikipedia.org/wiki/Loop_splitting
>> That is not what I am talking about.
>>
>> Example code may illustrate my point. The UNROLL4 simd loop below takes
>> about 25% less time than both the Eigen MapAligned path and the elementwise
>> simd loop.
>>
>> void vector_add(float * dst,const float * src1,const float * src2,int n)
>> {
>>     int k=0;
>>     bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) |
>> ptr2int(src2) ) ) );
>>     if (all_aligned) {
>> # ifdef USE_EIGEN
>>         VectorXf::MapAligned(dst,n) = VectorXf::MapAligned(src1,n) +
>> VectorXf::MapAligned(src2,n);
>>         return; // eigen takes care of the remaining samples after alignment
>> ends
>> # elif defined( UNROLL4 )
>>         // process 16 floats per loop
>>         for (; k+16<=n;k+=16) {
>>             __m128 a = _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) );
>>             __m128 b =
>> _mm_add_ps(_mm_load_ps(src1+k+4),_mm_load_ps(src2+k+4) );
>>             __m128 c =
>> _mm_add_ps(_mm_load_ps(src1+k+8),_mm_load_ps(src2+k+8) );
>>             __m128 d =
>> _mm_add_ps(_mm_load_ps(src1+k+12),_mm_load_ps(src2+k+12) );
>>             _mm_store_ps(dst+k, a);
>>             _mm_store_ps(dst+k+4,b);
>>             _mm_store_ps(dst+k+8, c);
>>             _mm_store_ps(dst+k+12, d);
>>         }
>> # else
>>         // one simd element ( 4 floats) at a time
>>         for (; k+4<=n;k+=4)
>>
>> _mm_store_ps(dst+k,_mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
>> # endif
>>     }
>>     for (;k<n;++k)
>>         dst[k] = src1[k] + src2[k];
>> }
>>
>>
>> test specifics
>>
>> n=512 floats ( small enough to fit into cache to avoid testing memory speed)
>> core2 cpu
>> linux 32 bit g++ 4.4.2
>> -DNDEBUG  -O3 -msse -msse2 -msse3
>>
>> For you assembly gurus:
>>
>> the elementwise simd loop and Eigen MapAligned code both compile to code
>> like this
>>
>> .L6:
>>         # basic block 5
>>         movaps  (%eax,%ecx,4), %xmm0    #* src1, tmp110
>>         addps   (%edx,%ecx,4), %xmm0    #* src2, tmp110
>>         movaps  %xmm0, (%ebx,%ecx,4)    # tmp110,* dst
>>         addl    $4, %ecx        #, index
>>         cmpl    %ecx, %esi      # index, index
>>         jg      .L6     #,
>>
>> and the unrolled simd loop compiles to
>>
>> .L8:
>>         # basic block 5
>>         movaps  (%edi,%eax,4), %xmm3    #* src1, tmp100
>>         movaps  16(%edi,%eax,4), %xmm2  #, tmp103
>>         movaps  32(%edi,%eax,4), %xmm1  #, tmp106
>>         movaps  48(%edi,%eax,4), %xmm0  #, tmp109
>>         addps   (%esi,%eax,4), %xmm3    #* src2, tmp100
>>         addps   16(%esi,%eax,4), %xmm2  #, tmp103
>>         addps   32(%esi,%eax,4), %xmm1  #, tmp106
>>         addps   48(%esi,%eax,4), %xmm0  #, tmp109
>>         movaps  %xmm3, (%ebx,%eax,4)    # tmp100,* dst
>>         movaps  %xmm2, 16(%ebx,%eax,4)  # tmp103,
>>         movaps  %xmm1, 32(%ebx,%eax,4)  # tmp106,
>>         movaps  %xmm0, 48(%ebx,%eax,4)  # tmp109,
>>         addl    $16, %eax       #, k
>>         cmpl    %edx, %eax      # D.54688, k
>>         jne     .L8     #,
>>
>>
>>
>>
>
> yes I'm totally aware of this kind of optimization and as you can
> guess we already use it a lot for the matrix products. However, doing
> so in a general fashion, i.e., for all kind of expressions is far to
> be trivial. Indeed, ideally we would simply partially unroll our
> evaluation loops. Let's have a look at the default unvectorized one
> (as said Benoit this problem is completely orthogonal to the
> vectorization):
>
> for(int j = 0; j < cols; ++j)
>  for(int i = 0; i < rows; ++i)
>    dst.coeffRef(i, j) = src.coeff(i,j);
>
> here dst is the destination expression (can be a Matrix, a
> Block<Matrix,XXX>, a Map, etc.), and src is right hand side expression
> which can mix an arbitrary number of matrix and sub matrices. Note
> that this is a simplified version which assumes a column major
> destination matrix. A partially unrolled version would be:
>
> for(int j = 0; j < cols; ++j)
> {
>  for(int i = 0; i < rows; i+=4)
>  {
>    dst.coeffRef(i+0, j) = src.coeff(i+0,j);
>    dst.coeffRef(i+1, j) = src.coeff(i+1,j);
>    dst.coeffRef(i+2, j) = src.coeff(i+2,j);
>    dst.coeffRef(i+3, j) = src.coeff(i+3,j);
>  }
>  for(remaining coeffs i)
>    dst.coeffRef(i, j) = src.coeff(i,j);
> }
>
> The problem is that when I tried that with GCC and ICC, even if src is
> a simple expression, like a + b, both compilers did not manage to
> generate an optimized code: the 4 unrolled operations were performed
> sequentially reusing the same two registers. In this case no gain can
> be expected.Yes that's a pity.
>
> Trying to overcome this compiler limitation by generating our own
> optimized code would be very very complicated because src can be any
> arbitrary expression.
>
> Also note that this kind of optimizations are worth it only if the
> in/out data lies in the CPU cache. This is shown by our benchmark
> (http://download.tuxfamily.org/eigen/btl-results/axpy.pdf) for Y += a
> * X. Here, Eigen and GOTO (or MKL) only differs by the fact that GOTO
> (or MKL) partially unroll the main loop (to be precise only two steps
> are unrolled), and GOTO (or MKL) is faster only for vector sizes in
> the range 300-4000.
>
> Perhaps GCC 4.5 is better at that now ?

some news: I tried again with a recent Eigen and a recent gcc (4.4),
and things are a bit different. Now, naively peeling (partial
unrolling) the loop slightly improve performances for Y += a*X, though
the generate code is still not optimal. Here is the respective
evaluation loop for the vectorized case (this is a simplified
version):

for (int i=0; i<dst.size(); i+=PacketSize*2)
{
  dst.template writePacket<Aligned>(i+0, src.template packet<Aligned>(i+0));
  dst.template writePacket<Aligned>(i+PacketSize, src.template
packet<Aligned>(i+PacketSize));
}

Furthermore, I managed to get an optimal code by rewriting the above
code like this:

for (int i=0; i<dst.size(); i+=Packetsize*2)
{
  Packet a0, a1; // Packet == __m128 for floats
  a0 = src.template packet<Aligned>(i+0);
  a1 = src.template packet<Aligned>(i+PacketSize);
  dst.template writePacket<Aligned>(i+0, a0);
  dst.template writePacket<Aligned>(i+PacketSize, a1);
}

Unfortunately, our current evaluation loops are a bit different. We
don't directly manipulate the packet/writePacket functions. Instead we
use a copyPacket function which does both. So the naively peeled loop
actually looks like that:

for (int i=0; i<dst.size(); i+=PacketSize*2)
{
  dst.template copyPacket<Src, Aligned, Aligned>(i+0,src);
  dst.template copyPacket<Src, Aligned, Aligned>(i+PacketSize,src);
}

and therefore we cannot do the above trick anymore.

Actually, this copyPacket function is a somewhat ugly trick allowing
us to reuse all our evaluation loops for swap() and some comp.
assignment (like a += XXX;). For instance swap() is implemented via a
pseudo expression overloading the copyPacket (and copyCoeff) function.
Comp. assignment are implemented the same way via a pseudo expression
overloading copyPacket) in order to make sure that destination is
loaded using aligned intrinsics. For instance think about
a.segment(1,17) += XXX;. If we implement operator += doing:
a.segment(1,17) = a.segment(1,17) + XXX; then cannot exploit the fact
that a.segment(1,17) appears on both sides, and so they have the same
alignment pattern.

So to conclude, if really want to partially unroll our loops we have
first to find another way to implement swap and the likes to get rid
of the copyCoeff/copyPacket functions.

gael.



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/