Re: [eigen] Re: 4x4 matrix inverse

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Tue, Jan 19, 2010 at 5:43 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/1/19 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> B. Ober told us about a more recent version of Intel's SSE fast
>> inversion code which can be found here:
>>
>> http://software.intel.com/en-us/articles/optimized-matrix-library-for-use-with-the-intel-pentiumr-4-processors-sse2-instructions/
>>
>> It takes advantage of SSE2 instructions, and there is also a version
>> for double, and all of this with a clearer license.
>>
>> Both versions (floats and doubles) are already in the devel branch. If
>> you wonder, here are some results for 10,000,000 inversions:
>>
>> float, no SSE : 1.72s
>> float, SSE (previous version): 0.29s
>> float, SSE2 (new version): 0.26s
>>
>> double, no SSE: 1.72
>> double, SSE2: 0.45
>>
>> (core2 2.66GHz, gcc 4.4, linux 64bits)
>
> Excellent!
>
> I have a question. Here, the benefit is like 7x for float and 3.5x for
> double. Is that thanks to the CPU doing add+mul in 1 cycle?

I think that the main reason to explain that the gain is higher than
4, is because when you use packets the entire matrix fit into only 4
registers, so it remains 12 registers to store all intermediate
values, etc. The consequences are that 1) the memory accesses are
reduced to the minimal, and 2) the pipelining is optimized (the
additional registers allows to reduce instruction dependencies).

> I then
> have a second question: is this ability
>  - only for addps/mulps instructions ?
>  - or also for addss / mulss instructions ?

yes

>  - or also with x87 code?

no


>
> If it works with scalar instructions, is the no-SSE code slow just
> because it is not written in low-level code carefully ordering
> instructions?

also but that's not the main reason.

gael


>
> Benoit
>
>>
>> gael
>>
>>
>> On Tue, Dec 15, 2009 at 2:38 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>> 2009/12/15 mmoll <Markus.Moll@xxxxxxxxxxxxxxxx>:
>>>> Hi
>>>>
>>>> Quoting Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>>> Actually, these lines were not equivalent to loads !
>>>>>
>>>>> When you look at this,
>>>>>
>>>>>    tmp1  = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)),
>>>>> (__m64*)(src+ 4));
>>>>>
>>>>> The second half is loaded from src+4, not src+2.
>>>>>
>>>>> What is being loaded here is the top-left 2x2 corner of the matrix.
>>>>
>>>> Ah, I was wondering what the purpose was. But can't the same be achieved
>>>> by a combination of
>>>>
>>>> 1. aligned loads of the matrix rows into say a, b, c, d (a=[a4,a3,a2,a1]
>>>> and so on)
>>>> 2. unpack quad word pairs: _mm_unpackhi_epi64(b, a) apparently yields
>>>> [a4,a3,b4,b3] (upper left) and _mm_unpacklo_epi64(b, a) yields [a2, a1,
>>>> b2, b1] (upper right)? (this is SSE2, though)
>>>>
>>>> I have no idea how the performance compares, though. (or whether it
>>>> works at all)
>>>
>>> You know this much better than me (honest), why don't you try it? If
>>> it's faster, we'll use it. SSE2 is OK, we require it anyway for any
>>> SSE code.
>>>
>>> Benoit
>>>
>>>
>>>
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/