Re: [eigen] std::complex vectorization braindump

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/1/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> On Fri, Jan 15, 2010 at 9:44 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> On Thu, Jan 14, 2010 at 5:49 PM, Mark Borgerding <mark@xxxxxxxxxxxxxx> wrote:
>>> On 01/13/2010 10:08 PM, Benoit Jacob wrote:
>>>
>>> 2010/1/13 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
>>>
>>>
>>> I've found a certain amount of loop unrolling very beneficial to speed, even
>>> with SIMD.
>>> e.g. loading 4 SIMD registers, working on them, then storing the results can
>>> be much faster than doing them one at a time.
>>>
>>>
>>> I see. This kind of unrolling is what we called peeling and I can
>>> believe that in some cases it brings benefits. Normally I would like
>>> to keep peeling completely orthogonal to vectorization,
>>>
>>>
>>> If you mean the definition of "loop peeling" given at
>>> http://en.wikipedia.org/wiki/Loop_splitting
>>> That is not what I am talking about.
>>>
>>> Example code may illustrate my point. The UNROLL4 simd loop below takes
>>> about 25% less time than both the Eigen MapAligned path and the elementwise
>>> simd loop.
>>>
>>> void vector_add(float * dst,const float * src1,const float * src2,int n)
>>> {
>>>     int k=0;
>>>     bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) |
>>> ptr2int(src2) ) ) );
>>>     if (all_aligned) {
>>> # ifdef USE_EIGEN
>>>         VectorXf::MapAligned(dst,n) = VectorXf::MapAligned(src1,n) +
>>> VectorXf::MapAligned(src2,n);
>>>         return; // eigen takes care of the remaining samples after alignment
>>> ends
>>> # elif defined( UNROLL4 )
>>>         // process 16 floats per loop
>>>         for (; k+16<=n;k+=16) {
>>>             __m128 a = _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) );
>>>             __m128 b =
>>> _mm_add_ps(_mm_load_ps(src1+k+4),_mm_load_ps(src2+k+4) );
>>>             __m128 c =
>>> _mm_add_ps(_mm_load_ps(src1+k+8),_mm_load_ps(src2+k+8) );
>>>             __m128 d =
>>> _mm_add_ps(_mm_load_ps(src1+k+12),_mm_load_ps(src2+k+12) );
>>>             _mm_store_ps(dst+k, a);
>>>             _mm_store_ps(dst+k+4,b);
>>>             _mm_store_ps(dst+k+8, c);
>>>             _mm_store_ps(dst+k+12, d);
>>>         }
>>> # else
>>>         // one simd element ( 4 floats) at a time
>>>         for (; k+4<=n;k+=4)
>>>
>>> _mm_store_ps(dst+k,_mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
>>> # endif
>>>     }
>>>     for (;k<n;++k)
>>>         dst[k] = src1[k] + src2[k];
>>> }
>>>
>>>
>>> test specifics
>>>
>>> n=512 floats ( small enough to fit into cache to avoid testing memory speed)
>>> core2 cpu
>>> linux 32 bit g++ 4.4.2
>>> -DNDEBUG  -O3 -msse -msse2 -msse3
>>>
>>> For you assembly gurus:
>>>
>>> the elementwise simd loop and Eigen MapAligned code both compile to code
>>> like this
>>>
>>> .L6:
>>>         # basic block 5
>>>         movaps  (%eax,%ecx,4), %xmm0    #* src1, tmp110
>>>         addps   (%edx,%ecx,4), %xmm0    #* src2, tmp110
>>>         movaps  %xmm0, (%ebx,%ecx,4)    # tmp110,* dst
>>>         addl    $4, %ecx        #, index
>>>         cmpl    %ecx, %esi      # index, index
>>>         jg      .L6     #,
>>>
>>> and the unrolled simd loop compiles to
>>>
>>> .L8:
>>>         # basic block 5
>>>         movaps  (%edi,%eax,4), %xmm3    #* src1, tmp100
>>>         movaps  16(%edi,%eax,4), %xmm2  #, tmp103
>>>         movaps  32(%edi,%eax,4), %xmm1  #, tmp106
>>>         movaps  48(%edi,%eax,4), %xmm0  #, tmp109
>>>         addps   (%esi,%eax,4), %xmm3    #* src2, tmp100
>>>         addps   16(%esi,%eax,4), %xmm2  #, tmp103
>>>         addps   32(%esi,%eax,4), %xmm1  #, tmp106
>>>         addps   48(%esi,%eax,4), %xmm0  #, tmp109
>>>         movaps  %xmm3, (%ebx,%eax,4)    # tmp100,* dst
>>>         movaps  %xmm2, 16(%ebx,%eax,4)  # tmp103,
>>>         movaps  %xmm1, 32(%ebx,%eax,4)  # tmp106,
>>>         movaps  %xmm0, 48(%ebx,%eax,4)  # tmp109,
>>>         addl    $16, %eax       #, k
>>>         cmpl    %edx, %eax      # D.54688, k
>>>         jne     .L8     #,
>>>
>>>
>>>
>>>
>>
>> yes I'm totally aware of this kind of optimization and as you can
>> guess we already use it a lot for the matrix products. However, doing
>> so in a general fashion, i.e., for all kind of expressions is far to
>> be trivial. Indeed, ideally we would simply partially unroll our
>> evaluation loops. Let's have a look at the default unvectorized one
>> (as said Benoit this problem is completely orthogonal to the
>> vectorization):
>>
>> for(int j = 0; j < cols; ++j)
>>  for(int i = 0; i < rows; ++i)
>>    dst.coeffRef(i, j) = src.coeff(i,j);
>>
>> here dst is the destination expression (can be a Matrix, a
>> Block<Matrix,XXX>, a Map, etc.), and src is right hand side expression
>> which can mix an arbitrary number of matrix and sub matrices. Note
>> that this is a simplified version which assumes a column major
>> destination matrix. A partially unrolled version would be:
>>
>> for(int j = 0; j < cols; ++j)
>> {
>>  for(int i = 0; i < rows; i+=4)
>>  {
>>    dst.coeffRef(i+0, j) = src.coeff(i+0,j);
>>    dst.coeffRef(i+1, j) = src.coeff(i+1,j);
>>    dst.coeffRef(i+2, j) = src.coeff(i+2,j);
>>    dst.coeffRef(i+3, j) = src.coeff(i+3,j);
>>  }
>>  for(remaining coeffs i)
>>    dst.coeffRef(i, j) = src.coeff(i,j);
>> }
>>
>> The problem is that when I tried that with GCC and ICC, even if src is
>> a simple expression, like a + b, both compilers did not manage to
>> generate an optimized code: the 4 unrolled operations were performed
>> sequentially reusing the same two registers. In this case no gain can
>> be expected.Yes that's a pity.
>>
>> Trying to overcome this compiler limitation by generating our own
>> optimized code would be very very complicated because src can be any
>> arbitrary expression.
>>
>> Also note that this kind of optimizations are worth it only if the
>> in/out data lies in the CPU cache. This is shown by our benchmark
>> (http://download.tuxfamily.org/eigen/btl-results/axpy.pdf) for Y += a
>> * X. Here, Eigen and GOTO (or MKL) only differs by the fact that GOTO
>> (or MKL) partially unroll the main loop (to be precise only two steps
>> are unrolled), and GOTO (or MKL) is faster only for vector sizes in
>> the range 300-4000.
>>
>> Perhaps GCC 4.5 is better at that now ?
>
> some news: I tried again with a recent Eigen and a recent gcc (4.4),
> and things are a bit different. Now, naively peeling (partial
> unrolling) the loop slightly improve performances for Y += a*X, though
> the generate code is still not optimal. Here is the respective
> evaluation loop for the vectorized case (this is a simplified
> version):
>
> for (int i=0; i<dst.size(); i+=PacketSize*2)
> {
>  dst.template writePacket<Aligned>(i+0, src.template packet<Aligned>(i+0));
>  dst.template writePacket<Aligned>(i+PacketSize, src.template
> packet<Aligned>(i+PacketSize));
> }
>
> Furthermore, I managed to get an optimal code by rewriting the above
> code like this:
>
> for (int i=0; i<dst.size(); i+=Packetsize*2)
> {
>  Packet a0, a1; // Packet == __m128 for floats
>  a0 = src.template packet<Aligned>(i+0);
>  a1 = src.template packet<Aligned>(i+PacketSize);
>  dst.template writePacket<Aligned>(i+0, a0);
>  dst.template writePacket<Aligned>(i+PacketSize, a1);
> }
>
> Unfortunately, our current evaluation loops are a bit different. We
> don't directly manipulate the packet/writePacket functions. Instead we
> use a copyPacket function which does both. So the naively peeled loop
> actually looks like that:
>
> for (int i=0; i<dst.size(); i+=PacketSize*2)
> {
>  dst.template copyPacket<Src, Aligned, Aligned>(i+0,src);
>  dst.template copyPacket<Src, Aligned, Aligned>(i+PacketSize,src);
> }
>
> and therefore we cannot do the above trick anymore.
>
> Actually, this copyPacket function is a somewhat ugly trick allowing
> us to reuse all our evaluation loops for swap() and some comp.
> assignment (like a += XXX;). For instance swap() is implemented via a
> pseudo expression overloading the copyPacket (and copyCoeff) function.
> Comp. assignment are implemented the same way via a pseudo expression
> overloading copyPacket) in order to make sure that destination is
> loaded using aligned intrinsics. For instance think about
> a.segment(1,17) += XXX;. If we implement operator += doing:
> a.segment(1,17) = a.segment(1,17) + XXX; then cannot exploit the fact
> that a.segment(1,17) appears on both sides, and so they have the same
> alignment pattern.
>
> So to conclude, if really want to partially unroll our loops we have
> first to find another way to implement swap and the likes to get rid
> of the copyCoeff/copyPacket functions.

I see; then why not add methods:
  copyCoeffs
  copyPackets
---> notice the 's' for plural.

Indeed if the problem is that you need to reorder the instructions
among N copypacket calls, then the solution should be to introduce a
copyPackets<N> method....?

Benoit


>
> gael.
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/