Re: [eigen] Re: 4x4 matrix inverse

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>
>
> On Tue, Dec 15, 2009 at 12:52 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
> wrote:
>>
>> 2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> >
>> >
>> > On Tue, Dec 15, 2009 at 5:25 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> To summarize recent commits: all this is now done in the development
>> >> branch, it only remains to consider backporting.
>> >>
>> >> The SSE code is 4.5x faster than my plain scalar path! I guess that's
>> >> explained not only by SSE intrinsics but also by better ordering of
>> >> instructions...
>> >>
>> >> There is one thing where I didn't follow Intel's code: they use a
>> >> RCPSS instruction to compute 1/det approximately, then followed by a
>> >> Newton-Raphson iteration. This sacrifices up to 2 bits of precision in
>> >> the mantissa, which already is a bit nontrivial for us (4x4 matrix
>> >> inversion is a basic operation on which people will rely very
>> >> heavily). To help solve that dilemma (performance vs precision) I
>> >> benchmarked it, and it turns out than on my core i7,  DIVSS is
>> >> slightly faster !! Intel's paper was written for the pentium 3 so
>> >> that's perhaps not surprising, but I saw forum posts mentioning that
>> >> the RCPSS trick is still faster on the Core2. If you want to test, see
>> >> lines 128-130 in Inverse_SSE.h.
>> >>
>> >> I have a question. I currently get warnings in this code (taken
>> >> straight from Intel):
>> >>
>> >>    __m128 tmp1;
>> >>    tmp1  = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)),
>> >> (__m64*)(src+
>> >> 4));
>> >
>> > why don't you use tmp1 = ei_pload(src); since we know src will be
>> > aligned ?
>>
>> You're right, I didn't look carefully at this line from Intel, and
>> just below, there are these lines:
>>
>>    tmp1  = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)),
>> (__m64*)(src+ 6));
>>    row3  = _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)),
>> (__m64*)(src+14));
>>
>> which cannot use a ei_pload since they use non-multiple-of-16-bytes
>> offsets, and I was confused from there.
>>
>> By the way, is this trick (to load from 64-bit aligned addresses)
>> worth abstracting into a "ei_pload8" function? It's probably faster
>> than completely unaligned loads...
>
> this is more or less what we do in ei_ploadu, but using a pair of
> movsd/movhps and inline assembly to avoid GCC messing up. But currently for
> MSVC we use raw unaligned loads, so we probably should switch to a pair of
> intrinsics.
>
> I chose a pair of movsd/movhps because it appeared to be fastest option for
> my CPU.
>
> Finally, to be clear I think you can change these 2 lines with ei_ploadu and
> the perf. should be the same.

Actually, these lines were not equivalent to loads !

When you look at this,

   tmp1  = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));

The second half is loaded from src+4, not src+2.

What is being loaded here is the top-left 2x2 corner of the matrix.

Benoit



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/