Re: [eigen] std::complex vectorization braindump

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Sat, Jan 16, 2010 at 5:19 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/1/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> On Fri, Jan 15, 2010 at 9:44 AM, Gael Guennebaud
>> <gael.guennebaud@xxxxxxxxx> wrote:
>>> On Thu, Jan 14, 2010 at 5:49 PM, Mark Borgerding <mark@xxxxxxxxxxxxxx> wrote:
>>>> On 01/13/2010 10:08 PM, Benoit Jacob wrote:
>>>>
>>>> 2010/1/13 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
>>>>
>>>>
>>>> I've found a certain amount of loop unrolling very beneficial to speed, even
>>>> with SIMD.
>>>> e.g. loading 4 SIMD registers, working on them, then storing the results can
>>>> be much faster than doing them one at a time.
>>>>
>>>>
>>>> I see. This kind of unrolling is what we called peeling and I can
>>>> believe that in some cases it brings benefits. Normally I would like
>>>> to keep peeling completely orthogonal to vectorization,
>>>>
>>>>
>>>> If you mean the definition of "loop peeling" given at
>>>> http://en.wikipedia.org/wiki/Loop_splitting
>>>> That is not what I am talking about.
>>>>
>>>> Example code may illustrate my point. The UNROLL4 simd loop below takes
>>>> about 25% less time than both the Eigen MapAligned path and the elementwise
>>>> simd loop.
>>>>
>>>> void vector_add(float * dst,const float * src1,const float * src2,int n)
>>>> {
>>>>     int k=0;
>>>>     bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) |
>>>> ptr2int(src2) ) ) );
>>>>     if (all_aligned) {
>>>> # ifdef USE_EIGEN
>>>>         VectorXf::MapAligned(dst,n) = VectorXf::MapAligned(src1,n) +
>>>> VectorXf::MapAligned(src2,n);
>>>>         return; // eigen takes care of the remaining samples after alignment
>>>> ends
>>>> # elif defined( UNROLL4 )
>>>>         // process 16 floats per loop
>>>>         for (; k+16<=n;k+=16) {
>>>>             __m128 a = _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) );
>>>>             __m128 b =
>>>> _mm_add_ps(_mm_load_ps(src1+k+4),_mm_load_ps(src2+k+4) );
>>>>             __m128 c =
>>>> _mm_add_ps(_mm_load_ps(src1+k+8),_mm_load_ps(src2+k+8) );
>>>>             __m128 d =
>>>> _mm_add_ps(_mm_load_ps(src1+k+12),_mm_load_ps(src2+k+12) );
>>>>             _mm_store_ps(dst+k, a);
>>>>             _mm_store_ps(dst+k+4,b);
>>>>             _mm_store_ps(dst+k+8, c);
>>>>             _mm_store_ps(dst+k+12, d);
>>>>         }
>>>> # else
>>>>         // one simd element ( 4 floats) at a time
>>>>         for (; k+4<=n;k+=4)
>>>>
>>>> _mm_store_ps(dst+k,_mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
>>>> # endif
>>>>     }
>>>>     for (;k<n;++k)
>>>>         dst[k] = src1[k] + src2[k];
>>>> }
>>>>
>>>>
>>>> test specifics
>>>>
>>>> n=512 floats ( small enough to fit into cache to avoid testing memory speed)
>>>> core2 cpu
>>>> linux 32 bit g++ 4.4.2
>>>> -DNDEBUG  -O3 -msse -msse2 -msse3
>>>>
>>>> For you assembly gurus:
>>>>
>>>> the elementwise simd loop and Eigen MapAligned code both compile to code
>>>> like this
>>>>
>>>> .L6:
>>>>         # basic block 5
>>>>         movaps  (%eax,%ecx,4), %xmm0    #* src1, tmp110
>>>>         addps   (%edx,%ecx,4), %xmm0    #* src2, tmp110
>>>>         movaps  %xmm0, (%ebx,%ecx,4)    # tmp110,* dst
>>>>         addl    $4, %ecx        #, index
>>>>         cmpl    %ecx, %esi      # index, index
>>>>         jg      .L6     #,
>>>>
>>>> and the unrolled simd loop compiles to
>>>>
>>>> .L8:
>>>>         # basic block 5
>>>>         movaps  (%edi,%eax,4), %xmm3    #* src1, tmp100
>>>>         movaps  16(%edi,%eax,4), %xmm2  #, tmp103
>>>>         movaps  32(%edi,%eax,4), %xmm1  #, tmp106
>>>>         movaps  48(%edi,%eax,4), %xmm0  #, tmp109
>>>>         addps   (%esi,%eax,4), %xmm3    #* src2, tmp100
>>>>         addps   16(%esi,%eax,4), %xmm2  #, tmp103
>>>>         addps   32(%esi,%eax,4), %xmm1  #, tmp106
>>>>         addps   48(%esi,%eax,4), %xmm0  #, tmp109
>>>>         movaps  %xmm3, (%ebx,%eax,4)    # tmp100,* dst
>>>>         movaps  %xmm2, 16(%ebx,%eax,4)  # tmp103,
>>>>         movaps  %xmm1, 32(%ebx,%eax,4)  # tmp106,
>>>>         movaps  %xmm0, 48(%ebx,%eax,4)  # tmp109,
>>>>         addl    $16, %eax       #, k
>>>>         cmpl    %edx, %eax      # D.54688, k
>>>>         jne     .L8     #,
>>>>
>>>>
>>>>
>>>>
>>>
>>> yes I'm totally aware of this kind of optimization and as you can
>>> guess we already use it a lot for the matrix products. However, doing
>>> so in a general fashion, i.e., for all kind of expressions is far to
>>> be trivial. Indeed, ideally we would simply partially unroll our
>>> evaluation loops. Let's have a look at the default unvectorized one
>>> (as said Benoit this problem is completely orthogonal to the
>>> vectorization):
>>>
>>> for(int j = 0; j < cols; ++j)
>>>  for(int i = 0; i < rows; ++i)
>>>    dst.coeffRef(i, j) = src.coeff(i,j);
>>>
>>> here dst is the destination expression (can be a Matrix, a
>>> Block<Matrix,XXX>, a Map, etc.), and src is right hand side expression
>>> which can mix an arbitrary number of matrix and sub matrices. Note
>>> that this is a simplified version which assumes a column major
>>> destination matrix. A partially unrolled version would be:
>>>
>>> for(int j = 0; j < cols; ++j)
>>> {
>>>  for(int i = 0; i < rows; i+=4)
>>>  {
>>>    dst.coeffRef(i+0, j) = src.coeff(i+0,j);
>>>    dst.coeffRef(i+1, j) = src.coeff(i+1,j);
>>>    dst.coeffRef(i+2, j) = src.coeff(i+2,j);
>>>    dst.coeffRef(i+3, j) = src.coeff(i+3,j);
>>>  }
>>>  for(remaining coeffs i)
>>>    dst.coeffRef(i, j) = src.coeff(i,j);
>>> }
>>>
>>> The problem is that when I tried that with GCC and ICC, even if src is
>>> a simple expression, like a + b, both compilers did not manage to
>>> generate an optimized code: the 4 unrolled operations were performed
>>> sequentially reusing the same two registers. In this case no gain can
>>> be expected.Yes that's a pity.
>>>
>>> Trying to overcome this compiler limitation by generating our own
>>> optimized code would be very very complicated because src can be any
>>> arbitrary expression.
>>>
>>> Also note that this kind of optimizations are worth it only if the
>>> in/out data lies in the CPU cache. This is shown by our benchmark
>>> (http://download.tuxfamily.org/eigen/btl-results/axpy.pdf) for Y += a
>>> * X. Here, Eigen and GOTO (or MKL) only differs by the fact that GOTO
>>> (or MKL) partially unroll the main loop (to be precise only two steps
>>> are unrolled), and GOTO (or MKL) is faster only for vector sizes in
>>> the range 300-4000.
>>>
>>> Perhaps GCC 4.5 is better at that now ?
>>
>> some news: I tried again with a recent Eigen and a recent gcc (4.4),
>> and things are a bit different. Now, naively peeling (partial
>> unrolling) the loop slightly improve performances for Y += a*X, though
>> the generate code is still not optimal. Here is the respective
>> evaluation loop for the vectorized case (this is a simplified
>> version):
>>
>> for (int i=0; i<dst.size(); i+=PacketSize*2)
>> {
>>  dst.template writePacket<Aligned>(i+0, src.template packet<Aligned>(i+0));
>>  dst.template writePacket<Aligned>(i+PacketSize, src.template
>> packet<Aligned>(i+PacketSize));
>> }
>>
>> Furthermore, I managed to get an optimal code by rewriting the above
>> code like this:
>>
>> for (int i=0; i<dst.size(); i+=Packetsize*2)
>> {
>>  Packet a0, a1; // Packet == __m128 for floats
>>  a0 = src.template packet<Aligned>(i+0);
>>  a1 = src.template packet<Aligned>(i+PacketSize);
>>  dst.template writePacket<Aligned>(i+0, a0);
>>  dst.template writePacket<Aligned>(i+PacketSize, a1);
>> }
>>
>> Unfortunately, our current evaluation loops are a bit different. We
>> don't directly manipulate the packet/writePacket functions. Instead we
>> use a copyPacket function which does both. So the naively peeled loop
>> actually looks like that:
>>
>> for (int i=0; i<dst.size(); i+=PacketSize*2)
>> {
>>  dst.template copyPacket<Src, Aligned, Aligned>(i+0,src);
>>  dst.template copyPacket<Src, Aligned, Aligned>(i+PacketSize,src);
>> }
>>
>> and therefore we cannot do the above trick anymore.
>>
>> Actually, this copyPacket function is a somewhat ugly trick allowing
>> us to reuse all our evaluation loops for swap() and some comp.
>> assignment (like a += XXX;). For instance swap() is implemented via a
>> pseudo expression overloading the copyPacket (and copyCoeff) function.
>> Comp. assignment are implemented the same way via a pseudo expression
>> overloading copyPacket) in order to make sure that destination is
>> loaded using aligned intrinsics. For instance think about
>> a.segment(1,17) += XXX;. If we implement operator += doing:
>> a.segment(1,17) = a.segment(1,17) + XXX; then cannot exploit the fact
>> that a.segment(1,17) appears on both sides, and so they have the same
>> alignment pattern.
>>
>> So to conclude, if really want to partially unroll our loops we have
>> first to find another way to implement swap and the likes to get rid
>> of the copyCoeff/copyPacket functions.
>
> I see; then why not add methods:
>  copyCoeffs
>  copyPackets
> ---> notice the 's' for plural.
>
> Indeed if the problem is that you need to reorder the instructions
> among N copypacket calls, then the solution should be to introduce a
> copyPackets<N> method....?

yes that could work.

gael.

>
> Benoit
>
>
>>
>> gael.
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/