Re: [eigen] Re: 4x4 matrix inverse

On Tue, Dec 15, 2009 at 5:25 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:

Hi,

To summarize recent commits: all this is now done in the development
branch, it only remains to consider backporting.

The SSE code is 4.5x faster than my plain scalar path! I guess that's
explained not only by SSE intrinsics but also by better ordering of
instructions...

There is one thing where I didn't follow Intel's code: they use a
RCPSS instruction to compute 1/det approximately, then followed by a
Newton-Raphson iteration. This sacrifices up to 2 bits of precision in
the mantissa, which already is a bit nontrivial for us (4x4 matrix
inversion is a basic operation on which people will rely very
heavily). To help solve that dilemma (performance vs precision) I
benchmarked it, and it turns out than on my core i7, DIVSS is
slightly faster !! Intel's paper was written for the pentium 3 so
that's perhaps not surprising, but I saw forum posts mentioning that
the RCPSS trick is still faster on the Core2. If you want to test, see
lines 128-130 in Inverse_SSE.h.

I have a question. I currently get warnings in this code (taken
straight from Intel):

__m128 tmp1;
tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));

why don't you use tmp1 = ei_pload(src); since we know src will be aligned ?

gael.

The warning claims that tmp1 is used uninitalized here. GCC doesn't
understand that it only is passed to _mm_loadl_pi that writes into it,
does not read from it. How to fix that warning? I tested initializing
tmp1, this had a not-totally-negligible impact on performance (because
there are 2 more variables that need this). There does not seem to be
an __attribute__ for this.

Benoit

2009/12/4 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:

> Hi,
>
> Long ago I thought it would be a good idea to optimize 4x4 matrix
> inverse using "Euler's trick" which reduced greatly the number of
> operations but relies on some 2x2 block inside the matrix being
> invertible.
>
> The problem is that this gives bad precision, and the best compromise
> that I could find between precision and performance is still:
> - 10x more imprecise in the worst case
> - only 25% faster.
>
> My last reason to clinge to this approach is that it was supposedly
> more vectorizable, but reading this,
> ftp://download.intel.com/design/PentiumIII/sml/24504301..pdf
> I realized that Intel engineers actually figured how to vectorize the
> plain old cofactors approach very efficiently.
>
> So I'll switch to cofactors in both branches, I think. I'll also
> implement SSE at least in the default branch.
>
> Question: do you think that Intel's code is provided free of use? Or
> should I avoid looking at it? Even if I can't look at it, they still
> provide good explanations.
>
> Benoit
>