Re: [eigen] Re: 4x4 matrix inverse |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Re: 4x4 matrix inverse
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Tue, 19 Jan 2010 18:17:58 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=f1RZZmeIYBcg4FrTTG8OhQa3Xg6qbKH9fD05UoylTgc=; b=wUIdu49la3Y+nELR/GuOITb2Ee2Ht4TmNLHMZwAmdq7dXdwnUEJUXZEXch7EnzkS7L BraC2dM53vLZcTnGXETtEpvcQHvs5RKvGlaLlHgYx/R3AukI2GyQkd2006UO+wHQSIb/ j0kmhUsSghKcHz2vBr0WIqH43kUdPoTv3j4uc=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=EUqzIozahX2nThFi6MBfa4j6T1kHmbAACFCHi+eW9pyu8O3deSNZ0f7Jxr5SwBQ1wW 31yN7e0KmoXqxBd97k4dH7CaoS0EQmMR4GsvFPojL4mXdZTRCCGfEYDzxU8NkIKlsl93 HRZ3xWXy59aTlLjKn8POZpjdu2ORbjOZytuBI=
The "resulting formula" is the usual cofactors formula giving you the
matrix coefficients of the inverse.
That's what we do in the non-SSE case.
In the SSE case, it takes extra care to make optimal use of SSE
instructions, registers etc. That is what Intel's code is so good at.
Benoit
2010/1/19 Scott Stephens <stephens.js@xxxxxxxxx>:
> This is probably a stupid question, but I'm going to ask it anyway:
> are these approaches better than just using Mathematica to invert a
> 4x4 analytically, and coding up the resulting formula?
>
> On Tue, Jan 19, 2010 at 1:11 PM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> On Tue, Jan 19, 2010 at 5:43 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>> 2010/1/19 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>>>> B. Ober told us about a more recent version of Intel's SSE fast
>>>> inversion code which can be found here:
>>>>
>>>> http://software.intel.com/en-us/articles/optimized-matrix-library-for-use-with-the-intel-pentiumr-4-processors-sse2-instructions/
>>>>
>>>> It takes advantage of SSE2 instructions, and there is also a version
>>>> for double, and all of this with a clearer license.
>>>>
>>>> Both versions (floats and doubles) are already in the devel branch. If
>>>> you wonder, here are some results for 10,000,000 inversions:
>>>>
>>>> float, no SSE : 1.72s
>>>> float, SSE (previous version): 0.29s
>>>> float, SSE2 (new version): 0.26s
>>>>
>>>> double, no SSE: 1.72
>>>> double, SSE2: 0.45
>>>>
>>>> (core2 2.66GHz, gcc 4.4, linux 64bits)
>>>
>>> Excellent!
>>>
>>> I have a question. Here, the benefit is like 7x for float and 3.5x for
>>> double. Is that thanks to the CPU doing add+mul in 1 cycle?
>>
>> I think that the main reason to explain that the gain is higher than
>> 4, is because when you use packets the entire matrix fit into only 4
>> registers, so it remains 12 registers to store all intermediate
>> values, etc. The consequences are that 1) the memory accesses are
>> reduced to the minimal, and 2) the pipelining is optimized (the
>> additional registers allows to reduce instruction dependencies).
>>
>>> I then
>>> have a second question: is this ability
>>> - only for addps/mulps instructions ?
>>> - or also for addss / mulss instructions ?
>>
>> yes
>>
>>> - or also with x87 code?
>>
>> no
>>
>>
>>>
>>> If it works with scalar instructions, is the no-SSE code slow just
>>> because it is not written in low-level code carefully ordering
>>> instructions?
>>
>> also but that's not the main reason.
>>
>> gael
>>
>>
>>>
>>> Benoit
>>>
>>>>
>>>> gael
>>>>
>>>>
>>>> On Tue, Dec 15, 2009 at 2:38 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>>> 2009/12/15 mmoll <Markus.Moll@xxxxxxxxxxxxxxxx>:
>>>>>> Hi
>>>>>>
>>>>>> Quoting Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>>>>> Actually, these lines were not equivalent to loads !
>>>>>>>
>>>>>>> When you look at this,
>>>>>>>
>>>>>>> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)),
>>>>>>> (__m64*)(src+ 4));
>>>>>>>
>>>>>>> The second half is loaded from src+4, not src+2.
>>>>>>>
>>>>>>> What is being loaded here is the top-left 2x2 corner of the matrix.
>>>>>>
>>>>>> Ah, I was wondering what the purpose was. But can't the same be achieved
>>>>>> by a combination of
>>>>>>
>>>>>> 1. aligned loads of the matrix rows into say a, b, c, d (a=[a4,a3,a2,a1]
>>>>>> and so on)
>>>>>> 2. unpack quad word pairs: _mm_unpackhi_epi64(b, a) apparently yields
>>>>>> [a4,a3,b4,b3] (upper left) and _mm_unpacklo_epi64(b, a) yields [a2, a1,
>>>>>> b2, b1] (upper right)? (this is SSE2, though)
>>>>>>
>>>>>> I have no idea how the performance compares, though. (or whether it
>>>>>> works at all)
>>>>>
>>>>> You know this much better than me (honest), why don't you try it? If
>>>>> it's faster, we'll use it. SSE2 is OK, we require it anyway for any
>>>>> SSE code.
>>>>>
>>>>> Benoit
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>