2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>: > > > On Tue, Dec 15, 2009 at 12:52 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> > wrote: >> >> 2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>: >> > >> > >> > On Tue, Dec 15, 2009 at 5:25 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> >> > wrote: >> >> >> >> Hi, >> >> >> >> To summarize recent commits: all this is now done in the development >> >> branch, it only remains to consider backporting. >> >> >> >> The SSE code is 4.5x faster than my plain scalar path! I guess that's >> >> explained not only by SSE intrinsics but also by better ordering of >> >> instructions... >> >> >> >> There is one thing where I didn't follow Intel's code: they use a >> >> RCPSS instruction to compute 1/det approximately, then followed by a >> >> Newton-Raphson iteration. This sacrifices up to 2 bits of precision in >> >> the mantissa, which already is a bit nontrivial for us (4x4 matrix >> >> inversion is a basic operation on which people will rely very >> >> heavily). To help solve that dilemma (performance vs precision) I >> >> benchmarked it, and it turns out than on my core i7, DIVSS is >> >> slightly faster !! Intel's paper was written for the pentium 3 so >> >> that's perhaps not surprising, but I saw forum posts mentioning that >> >> the RCPSS trick is still faster on the Core2. If you want to test, see >> >> lines 128-130 in Inverse_SSE.h. >> >> >> >> I have a question. I currently get warnings in this code (taken >> >> straight from Intel): >> >> >> >> __m128 tmp1; >> >> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), >> >> (__m64*)(src+ >> >> 4)); >> > >> > why don't you use tmp1 = ei_pload(src); since we know src will be >> > aligned ? >> >> You're right, I didn't look carefully at this line from Intel, and >> just below, there are these lines: >> >> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)), >> (__m64*)(src+ 6)); >> row3 = _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)), >> (__m64*)(src+14)); >> >> which cannot use a ei_pload since they use non-multiple-of-16-bytes >> offsets, and I was confused from there. >> >> By the way, is this trick (to load from 64-bit aligned addresses) >> worth abstracting into a "ei_pload8" function? It's probably faster >> than completely unaligned loads... > > this is more or less what we do in ei_ploadu, but using a pair of > movsd/movhps and inline assembly to avoid GCC messing up. But currently for > MSVC we use raw unaligned loads, so we probably should switch to a pair of > intrinsics. > > I chose a pair of movsd/movhps because it appeared to be fastest option for > my CPU. > > Finally, to be clear I think you can change these 2 lines with ei_ploadu and > the perf. should be the same. Actually, these lines were not equivalent to loads ! When you look at this, tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4)); The second half is loaded from src+4, not src+2. What is being loaded here is the top-left 2x2 corner of the matrix. Benoit

