Re: [eigen] Re: 4x4 matrix inverse

On Tue, Dec 15, 2009 at 1:01 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:

2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:

>
>
> On Tue, Dec 15, 2009 at 10:15 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxxx> wrote:
>>
>>
>> On Tue, Dec 15, 2009 at 9:09 AM, Hauke Heibel
>> <hauke.heibel@xxxxxxxxxxxxxx> wrote:
>>>
>>> On Tue, Dec 15, 2009 at 5:25 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>>> wrote:
>>>>
>>>> There is one thing where I didn't follow Intel's code: they use a
>>>> RCPSS instruction to compute 1/det approximately, then followed by a
>>>> Newton-Raphson iteration. This sacrifices up to 2 bits of precision in
>>>> the mantissa, which already is a bit nontrivial for us (4x4 matrix
>>>> inversion is a basic operation on which people will rely very
>>>> heavily).
>>>
>>> Hi Benoit, I recognized that you wrote in your commit log "that elsewhere
>>> in Eigen we dont allow ourselves this approximation" (Newton-Raphson). I
>>> just recalled stumbling once over such an approximation hidden in Eigen. It
>>> is for the SSE computation of the inverse square root.
>>>
>>> The approximation takes place overe here in Packet4f ei_psqrt(Packet4f
>>> _x) at the very bottom.
>>
>> Indeed, a while ago I think we agreed that this optimized version should
>> be enabled only if EIGEN_FAST_MATH==1 but it seems we never did that change.
>> Let me also recall that -ffast-math => EIGEN_FAST_MATH.
>>
>> So still OK to use the SSE intrinsic _mm_sqrt_p* by default and use the
>> optimized version only when EIGEN_FAST_MATH==1 ?
>
>
> sorry what I said is wrong. So currently we have by default
> EIGEN_FAST_MATH==1, and defining EIGEN_FAST_MATH==0 disable the vectorized
> version of sin and cos. If we want to be safe I suggest the following rules:
>
> 1- set EIGEN_FAST_MATH==0 by default
>
> 2- if "-ffast-fast" and EIGEN_FAST_MATH is not defined then we set
> EIGEN_FAST_MATH==1
>
> 3- EIGEN_FAST_MATH==0 => no vectorized version of sin and cos and the use of
> _mm_sqrt_p* for ei_psqrt
>
> 4- EIGEN_FAST_MATH==1 => vectorized version of sin and cos + the use of an
> optimized version of ei_psqrt
>
> Also let me recall that the optimized version of ei_psqrt is about 3x faster
> than the IEEE compliant version for floats but it loses about 2 bits of
> precision.

I agree with your plan but have 2 remarks:

1) as I said, I found that on core i7, the "fast division" using
RCP+Newton was not faster than a plain DIV anymore. So, at least in
this case, we see that we should always be doing DIV if very modern
CPUs are targeted, a good test for which is EIGEN_VECTORIZE_SSE4.

2) I wonder: when an algo loses 2 bits of precision, is it true that
doing one more Newton iteration would give full precision? If yes,
that would be worth doing for heavy operations like sqrt!
Theoretically, each Newton iteration doubles the number of accurate
bits, but there may be bad effects of floating point arithmetic
preventing that from happening in practice...

we already discussed about that:

http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2009/03/msg00097.html

and the conclusion was that more iterations did not help much because we are already too close to the precision limit.

gael