Re: [eigen] Re: 4x4 matrix inverse

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>
>
> On Tue, Dec 15, 2009 at 5:25 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
> wrote:
>>
>> Hi,
>>
>> To summarize recent commits: all this is now done in the development
>> branch, it only remains to consider backporting.
>>
>> The SSE code is 4.5x faster than my plain scalar path! I guess that's
>> explained not only by SSE intrinsics but also by better ordering of
>> instructions...
>>
>> There is one thing where I didn't follow Intel's code: they use a
>> RCPSS instruction to compute 1/det approximately, then followed by a
>> Newton-Raphson iteration. This sacrifices up to 2 bits of precision in
>> the mantissa, which already is a bit nontrivial for us (4x4 matrix
>> inversion is a basic operation on which people will rely very
>> heavily). To help solve that dilemma (performance vs precision) I
>> benchmarked it, and it turns out than on my core i7,  DIVSS is
>> slightly faster !! Intel's paper was written for the pentium 3 so
>> that's perhaps not surprising, but I saw forum posts mentioning that
>> the RCPSS trick is still faster on the Core2. If you want to test, see
>> lines 128-130 in Inverse_SSE.h.
>>
>> I have a question. I currently get warnings in this code (taken
>> straight from Intel):
>>
>>    __m128 tmp1;
>>    tmp1  = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+
>> 4));
>
> why don't you use tmp1 = ei_pload(src); since we know src will be aligned ?

You're right, I didn't look carefully at this line from Intel, and
just below, there are these lines:

    tmp1  = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)),
(__m64*)(src+ 6));
    row3  = _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)),
(__m64*)(src+14));

which cannot use a ei_pload since they use non-multiple-of-16-bytes
offsets, and I was confused from there.

By the way, is this trick (to load from 64-bit aligned addresses)
worth abstracting into a "ei_pload8" function? It's probably faster
than completely unaligned loads...

Thanks for the tip,
Benoit

>
> gael.
>
>>
>> The warning claims that tmp1 is used uninitalized here. GCC doesn't
>> understand that it only is passed to _mm_loadl_pi that writes into it,
>> does not read from it. How to fix that warning? I tested initializing
>> tmp1, this had a not-totally-negligible impact on performance (because
>> there are 2 more variables that need this). There does not seem to be
>> an __attribute__ for this.
>>
>> Benoit
>>
>> 2009/12/4 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> > Hi,
>> >
>> > Long ago I thought it would be a good idea to optimize 4x4 matrix
>> > inverse using "Euler's trick" which reduced greatly the number of
>> > operations but relies on some 2x2 block inside the matrix being
>> > invertible.
>> >
>> > The problem is that this gives bad precision, and the best compromise
>> > that I could find between precision and performance is still:
>> >  - 10x more imprecise in the worst case
>> >  - only 25% faster.
>> >
>> > My last reason to clinge to this approach is that it was supposedly
>> > more vectorizable, but reading this,
>> > ftp://download.intel.com/design/PentiumIII/sml/24504301.pdf
>> > I realized that Intel engineers actually figured how to vectorize the
>> > plain old cofactors approach very efficiently.
>> >
>> > So I'll switch to cofactors in both branches, I think. I'll also
>> > implement SSE at least in the default branch.
>> >
>> > Question: do you think that Intel's code is provided free of use? Or
>> > should I avoid looking at it? Even if I can't look at it, they still
>> > provide good explanations.
>> >
>> > Benoit
>> >
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/