Re: [eigen] Re: 4x4 matrix inverse

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


This is probably a stupid question, but I'm going to ask it anyway:
are these approaches better than just using Mathematica to invert a
4x4 analytically, and coding up the resulting formula?

On Tue, Jan 19, 2010 at 1:11 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> On Tue, Jan 19, 2010 at 5:43 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/1/19 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>>> B. Ober told us about a more recent version of Intel's SSE fast
>>> inversion code which can be found here:
>>>
>>> http://software.intel.com/en-us/articles/optimized-matrix-library-for-use-with-the-intel-pentiumr-4-processors-sse2-instructions/
>>>
>>> It takes advantage of SSE2 instructions, and there is also a version
>>> for double, and all of this with a clearer license.
>>>
>>> Both versions (floats and doubles) are already in the devel branch. If
>>> you wonder, here are some results for 10,000,000 inversions:
>>>
>>> float, no SSE : 1.72s
>>> float, SSE (previous version): 0.29s
>>> float, SSE2 (new version): 0.26s
>>>
>>> double, no SSE: 1.72
>>> double, SSE2: 0.45
>>>
>>> (core2 2.66GHz, gcc 4.4, linux 64bits)
>>
>> Excellent!
>>
>> I have a question. Here, the benefit is like 7x for float and 3.5x for
>> double. Is that thanks to the CPU doing add+mul in 1 cycle?
>
> I think that the main reason to explain that the gain is higher than
> 4, is because when you use packets the entire matrix fit into only 4
> registers, so it remains 12 registers to store all intermediate
> values, etc. The consequences are that 1) the memory accesses are
> reduced to the minimal, and 2) the pipelining is optimized (the
> additional registers allows to reduce instruction dependencies).
>
>> I then
>> have a second question: is this ability
>>  - only for addps/mulps instructions ?
>>  - or also for addss / mulss instructions ?
>
> yes
>
>>  - or also with x87 code?
>
> no
>
>
>>
>> If it works with scalar instructions, is the no-SSE code slow just
>> because it is not written in low-level code carefully ordering
>> instructions?
>
> also but that's not the main reason.
>
> gael
>
>
>>
>> Benoit
>>
>>>
>>> gael
>>>
>>>
>>> On Tue, Dec 15, 2009 at 2:38 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>> 2009/12/15 mmoll <Markus.Moll@xxxxxxxxxxxxxxxx>:
>>>>> Hi
>>>>>
>>>>> Quoting Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>>>> Actually, these lines were not equivalent to loads !
>>>>>>
>>>>>> When you look at this,
>>>>>>
>>>>>>    tmp1  = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)),
>>>>>> (__m64*)(src+ 4));
>>>>>>
>>>>>> The second half is loaded from src+4, not src+2.
>>>>>>
>>>>>> What is being loaded here is the top-left 2x2 corner of the matrix.
>>>>>
>>>>> Ah, I was wondering what the purpose was. But can't the same be achieved
>>>>> by a combination of
>>>>>
>>>>> 1. aligned loads of the matrix rows into say a, b, c, d (a=[a4,a3,a2,a1]
>>>>> and so on)
>>>>> 2. unpack quad word pairs: _mm_unpackhi_epi64(b, a) apparently yields
>>>>> [a4,a3,b4,b3] (upper left) and _mm_unpacklo_epi64(b, a) yields [a2, a1,
>>>>> b2, b1] (upper right)? (this is SSE2, though)
>>>>>
>>>>> I have no idea how the performance compares, though. (or whether it
>>>>> works at all)
>>>>
>>>> You know this much better than me (honest), why don't you try it? If
>>>> it's faster, we'll use it. SSE2 is OK, we require it anyway for any
>>>> SSE code.
>>>>
>>>> Benoit
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/