Re: [eigen] std::complex vectorization braindump |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
On 01/13/2010 10:08 PM, Benoit Jacob wrote:
2010/1/13 Mark Borgerding <mark@xxxxxxxxxxxxxx>:I've found a certain amount of loop unrolling very beneficial to speed, even with SIMD. e.g. loading 4 SIMD registers, working on them, then storing the results can be much faster than doing them one at a time.I see. This kind of unrolling is what we called peeling and I can believe that in some cases it brings benefits. Normally I would like to keep peeling completely orthogonal to vectorization, If you mean the definition of "loop peeling" given at http://en.wikipedia.org/wiki/Loop_splitting That is not what I am talking about. Example code may illustrate my point. The UNROLL4 simd loop below takes about 25% less time than both the Eigen MapAligned path and the elementwise simd loop. void vector_add(float * dst,const float * src1,const float * src2,int n) { int k=0; bool all_aligned = (0 == (15 & ( ptr2int(dst) | ptr2int(src1) | ptr2int(src2) ) ) ); if (all_aligned) { # ifdef USE_EIGEN VectorXf::MapAligned(dst,n) = VectorXf::MapAligned(src1,n) + VectorXf::MapAligned(src2,n); return; // eigen takes care of the remaining samples after alignment ends # elif defined( UNROLL4 ) // process 16 floats per loop for (; k+16<=n;k+=16) { __m128 a = _mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ); __m128 b = _mm_add_ps(_mm_load_ps(src1+k+4),_mm_load_ps(src2+k+4) ); __m128 c = _mm_add_ps(_mm_load_ps(src1+k+8),_mm_load_ps(src2+k+8) ); __m128 d = _mm_add_ps(_mm_load_ps(src1+k+12),_mm_load_ps(src2+k+12) ); _mm_store_ps(dst+k, a); _mm_store_ps(dst+k+4,b); _mm_store_ps(dst+k+8, c); _mm_store_ps(dst+k+12, d); } # else // one simd element ( 4 floats) at a time for (; k+4<=n;k+=4) _mm_store_ps(dst+k,_mm_add_ps(_mm_load_ps(src1+k),_mm_load_ps(src2+k) ) ); # endif } for (;k<n;++k) dst[k] = src1[k] + src2[k]; } test specifics
For you assembly gurus: the elementwise simd loop and Eigen MapAligned code both compile to code like this and the unrolled simd loop compiles to.L6: .L8: |
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |