Re: [eigen] std::complex vectorization braindump |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] std::complex vectorization braindump
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Fri, 15 Jan 2010 23:19:17 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=c+j/KoOlFkOEHU2bB6f9AEhEp3R/iOJsRU7x4VGTwHI=; b=wxN+NN2RUUoxem5PQ+XsCZD1hEjbCmx0UYV381Um6ZKepXtjRqRDSuEfiJx4sBUveD IzoiLsGwRiyP4bc/EYRHrBSqUT9B2ZgoAhDSVM18Z5s2opaUZqHQmTh6z60V1lrJrmca jgizGYvbGraAs/XKTL2dNmPWg+qgUABNOOFaA=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=m8x0gjbw6nLO0J/KQDFcpkCadRxrW1lqBSM38G43cVf9hWvOEXDde6nTtVS2Gziehw /Lb9/gYzr/OKE0aWBr5keCiy244mP5M/tr9tfaK8URrwrZRXqwXgmGc0qs3wCoXtuTPb AhcNNjuKIM7YGtdjE68rr3VWG5wnDOcPTs4bI=
2010/1/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> On Fri, Jan 15, 2010 at 9:44 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> On Thu, Jan 14, 2010 at 5:49 PM, Mark Borgerding <mark@xxxxxxxxxxxxxx> wrote:
>>> On 01/13/2010 10:08 PM, Benoit Jacob wrote:
>>>
>>> 2010/1/13 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
>>>
>>>
>>> I've found a certain amount of loop unrolling very beneficial to speed, even
>>> with SIMD.
>>> e.g. loading 4 SIMD registers, working on them, then storing the results can
>>> be much faster than doing them one at a time.
>>>
>>>
>>> I see. This kind of unrolling is what we called peeling and I can
>>> believe that in some cases it brings benefits. Normally I would like
>>> to keep peeling completely orthogonal to vectorization,
>>>
>>>
>>> If you mean the definition of "loop peeling" given at
>>> http://en.wikipedia.org/wiki/Loop_splitting
>>> That is not what I am talking about.
>>>
>>> Example code may illustrate my point. The UNROLL4 simd loop below takes
>>> about 25% less time than both the Eigen MapAligned path and the elementwise
>>> simd loop.
>>>
>>> void vector_add(float * dst,const float * src1,const float * src2,int n)
>>> {
>>> int k=0;
>>> bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) |
>>> ptr2int(src2) ) ) );
>>> if (all_aligned) {
>>> # ifdef USE_EIGEN
>>> VectorXf::MapAligned(dst,n) = VectorXf::MapAligned(src1,n) +
>>> VectorXf::MapAligned(src2,n);
>>> return; // eigen takes care of the remaining samples after alignment
>>> ends
>>> # elif defined( UNROLL4 )
>>> // process 16 floats per loop
>>> for (; k+16<=n;k+=16) {
>>> __m128 a = _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) );
>>> __m128 b =
>>> _mm_add_ps(_mm_load_ps(src1+k+4),_mm_load_ps(src2+k+4) );
>>> __m128 c =
>>> _mm_add_ps(_mm_load_ps(src1+k+8),_mm_load_ps(src2+k+8) );
>>> __m128 d =
>>> _mm_add_ps(_mm_load_ps(src1+k+12),_mm_load_ps(src2+k+12) );
>>> _mm_store_ps(dst+k, a);
>>> _mm_store_ps(dst+k+4,b);
>>> _mm_store_ps(dst+k+8, c);
>>> _mm_store_ps(dst+k+12, d);
>>> }
>>> # else
>>> // one simd element ( 4 floats) at a time
>>> for (; k+4<=n;k+=4)
>>>
>>> _mm_store_ps(dst+k,_mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) );
>>> # endif
>>> }
>>> for (;k<n;++k)
>>> dst[k] = src1[k] + src2[k];
>>> }
>>>
>>>
>>> test specifics
>>>
>>> n=512 floats ( small enough to fit into cache to avoid testing memory speed)
>>> core2 cpu
>>> linux 32 bit g++ 4.4.2
>>> -DNDEBUG -O3 -msse -msse2 -msse3
>>>
>>> For you assembly gurus:
>>>
>>> the elementwise simd loop and Eigen MapAligned code both compile to code
>>> like this
>>>
>>> .L6:
>>> # basic block 5
>>> movaps (%eax,%ecx,4), %xmm0 #* src1, tmp110
>>> addps (%edx,%ecx,4), %xmm0 #* src2, tmp110
>>> movaps %xmm0, (%ebx,%ecx,4) # tmp110,* dst
>>> addl $4, %ecx #, index
>>> cmpl %ecx, %esi # index, index
>>> jg .L6 #,
>>>
>>> and the unrolled simd loop compiles to
>>>
>>> .L8:
>>> # basic block 5
>>> movaps (%edi,%eax,4), %xmm3 #* src1, tmp100
>>> movaps 16(%edi,%eax,4), %xmm2 #, tmp103
>>> movaps 32(%edi,%eax,4), %xmm1 #, tmp106
>>> movaps 48(%edi,%eax,4), %xmm0 #, tmp109
>>> addps (%esi,%eax,4), %xmm3 #* src2, tmp100
>>> addps 16(%esi,%eax,4), %xmm2 #, tmp103
>>> addps 32(%esi,%eax,4), %xmm1 #, tmp106
>>> addps 48(%esi,%eax,4), %xmm0 #, tmp109
>>> movaps %xmm3, (%ebx,%eax,4) # tmp100,* dst
>>> movaps %xmm2, 16(%ebx,%eax,4) # tmp103,
>>> movaps %xmm1, 32(%ebx,%eax,4) # tmp106,
>>> movaps %xmm0, 48(%ebx,%eax,4) # tmp109,
>>> addl $16, %eax #, k
>>> cmpl %edx, %eax # D.54688, k
>>> jne .L8 #,
>>>
>>>
>>>
>>>
>>
>> yes I'm totally aware of this kind of optimization and as you can
>> guess we already use it a lot for the matrix products. However, doing
>> so in a general fashion, i.e., for all kind of expressions is far to
>> be trivial. Indeed, ideally we would simply partially unroll our
>> evaluation loops. Let's have a look at the default unvectorized one
>> (as said Benoit this problem is completely orthogonal to the
>> vectorization):
>>
>> for(int j = 0; j < cols; ++j)
>> for(int i = 0; i < rows; ++i)
>> dst.coeffRef(i, j) = src.coeff(i,j);
>>
>> here dst is the destination expression (can be a Matrix, a
>> Block<Matrix,XXX>, a Map, etc.), and src is right hand side expression
>> which can mix an arbitrary number of matrix and sub matrices. Note
>> that this is a simplified version which assumes a column major
>> destination matrix. A partially unrolled version would be:
>>
>> for(int j = 0; j < cols; ++j)
>> {
>> for(int i = 0; i < rows; i+=4)
>> {
>> dst.coeffRef(i+0, j) = src.coeff(i+0,j);
>> dst.coeffRef(i+1, j) = src.coeff(i+1,j);
>> dst.coeffRef(i+2, j) = src.coeff(i+2,j);
>> dst.coeffRef(i+3, j) = src.coeff(i+3,j);
>> }
>> for(remaining coeffs i)
>> dst.coeffRef(i, j) = src.coeff(i,j);
>> }
>>
>> The problem is that when I tried that with GCC and ICC, even if src is
>> a simple expression, like a + b, both compilers did not manage to
>> generate an optimized code: the 4 unrolled operations were performed
>> sequentially reusing the same two registers. In this case no gain can
>> be expected.Yes that's a pity.
>>
>> Trying to overcome this compiler limitation by generating our own
>> optimized code would be very very complicated because src can be any
>> arbitrary expression.
>>
>> Also note that this kind of optimizations are worth it only if the
>> in/out data lies in the CPU cache. This is shown by our benchmark
>> (http://download.tuxfamily.org/eigen/btl-results/axpy.pdf) for Y += a
>> * X. Here, Eigen and GOTO (or MKL) only differs by the fact that GOTO
>> (or MKL) partially unroll the main loop (to be precise only two steps
>> are unrolled), and GOTO (or MKL) is faster only for vector sizes in
>> the range 300-4000.
>>
>> Perhaps GCC 4.5 is better at that now ?
>
> some news: I tried again with a recent Eigen and a recent gcc (4.4),
> and things are a bit different. Now, naively peeling (partial
> unrolling) the loop slightly improve performances for Y += a*X, though
> the generate code is still not optimal. Here is the respective
> evaluation loop for the vectorized case (this is a simplified
> version):
>
> for (int i=0; i<dst.size(); i+=PacketSize*2)
> {
> dst.template writePacket<Aligned>(i+0, src.template packet<Aligned>(i+0));
> dst.template writePacket<Aligned>(i+PacketSize, src.template
> packet<Aligned>(i+PacketSize));
> }
>
> Furthermore, I managed to get an optimal code by rewriting the above
> code like this:
>
> for (int i=0; i<dst.size(); i+=Packetsize*2)
> {
> Packet a0, a1; // Packet == __m128 for floats
> a0 = src.template packet<Aligned>(i+0);
> a1 = src.template packet<Aligned>(i+PacketSize);
> dst.template writePacket<Aligned>(i+0, a0);
> dst.template writePacket<Aligned>(i+PacketSize, a1);
> }
>
> Unfortunately, our current evaluation loops are a bit different. We
> don't directly manipulate the packet/writePacket functions. Instead we
> use a copyPacket function which does both. So the naively peeled loop
> actually looks like that:
>
> for (int i=0; i<dst.size(); i+=PacketSize*2)
> {
> dst.template copyPacket<Src, Aligned, Aligned>(i+0,src);
> dst.template copyPacket<Src, Aligned, Aligned>(i+PacketSize,src);
> }
>
> and therefore we cannot do the above trick anymore.
>
> Actually, this copyPacket function is a somewhat ugly trick allowing
> us to reuse all our evaluation loops for swap() and some comp.
> assignment (like a += XXX;). For instance swap() is implemented via a
> pseudo expression overloading the copyPacket (and copyCoeff) function.
> Comp. assignment are implemented the same way via a pseudo expression
> overloading copyPacket) in order to make sure that destination is
> loaded using aligned intrinsics. For instance think about
> a.segment(1,17) += XXX;. If we implement operator += doing:
> a.segment(1,17) = a.segment(1,17) + XXX; then cannot exploit the fact
> that a.segment(1,17) appears on both sides, and so they have the same
> alignment pattern.
>
> So to conclude, if really want to partially unroll our loops we have
> first to find another way to implement swap and the likes to get rid
> of the copyCoeff/copyPacket functions.
I see; then why not add methods:
copyCoeffs
copyPackets
---> notice the 's' for plural.
Indeed if the problem is that you need to reorder the instructions
among N copypacket calls, then the solution should be to introduce a
copyPackets<N> method....?
Benoit
>
> gael.
>
>
>