Re: [eigen] Re: 4x4 matrix inverse |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Re: 4x4 matrix inverse
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Tue, 19 Jan 2010 11:43:42 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ZSh9IPU5XdGiRy4z6epjiwZLZ0/3mkHuJSj8cFuPWhE=; b=JWbZQnA1eyKHmom8HHN67jyO2guI6OmwW5dfBQpsnu4tjtN8d6UY/jKOmkW/46dOPa S19KaH5FoysR/DpD/HjYTwEzcRu7USMV9Fkz+/Ws88TeH6izk0HvNDJ4ZCQoCZePrW4J AVQizaLah1o6iG5PhBsPDIU3uk0KVuT55YNyc=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=bCIDoaEWsklB8aadEizn7Sg9cGITmAPNv5sE3DGtth1mWhmaO3ROtmIfWBlHxCRLVu sMxFrtxi2VdEXJ3Djb4OM8x7RggPRXm+oSUVL4YMGLIyRCsgmbUvZ4LAsqxE4oujkbe4 0LY866UjUfWAhympi4JXJtPRwIVOf5UwnWO2w=
2010/1/19 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> B. Ober told us about a more recent version of Intel's SSE fast
> inversion code which can be found here:
>
> http://software.intel.com/en-us/articles/optimized-matrix-library-for-use-with-the-intel-pentiumr-4-processors-sse2-instructions/
>
> It takes advantage of SSE2 instructions, and there is also a version
> for double, and all of this with a clearer license.
>
> Both versions (floats and doubles) are already in the devel branch. If
> you wonder, here are some results for 10,000,000 inversions:
>
> float, no SSE : 1.72s
> float, SSE (previous version): 0.29s
> float, SSE2 (new version): 0.26s
>
> double, no SSE: 1.72
> double, SSE2: 0.45
>
> (core2 2.66GHz, gcc 4.4, linux 64bits)
Excellent!
I have a question. Here, the benefit is like 7x for float and 3.5x for
double. Is that thanks to the CPU doing add+mul in 1 cycle? I then
have a second question: is this ability
- only for addps/mulps instructions ?
- or also for addss / mulss instructions ?
- or also with x87 code?
If it works with scalar instructions, is the no-SSE code slow just
because it is not written in low-level code carefully ordering
instructions?
Benoit
>
> gael
>
>
> On Tue, Dec 15, 2009 at 2:38 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2009/12/15 mmoll <Markus.Moll@xxxxxxxxxxxxxxxx>:
>>> Hi
>>>
>>> Quoting Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>> Actually, these lines were not equivalent to loads !
>>>>
>>>> When you look at this,
>>>>
>>>> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)),
>>>> (__m64*)(src+ 4));
>>>>
>>>> The second half is loaded from src+4, not src+2.
>>>>
>>>> What is being loaded here is the top-left 2x2 corner of the matrix.
>>>
>>> Ah, I was wondering what the purpose was. But can't the same be achieved
>>> by a combination of
>>>
>>> 1. aligned loads of the matrix rows into say a, b, c, d (a=[a4,a3,a2,a1]
>>> and so on)
>>> 2. unpack quad word pairs: _mm_unpackhi_epi64(b, a) apparently yields
>>> [a4,a3,b4,b3] (upper left) and _mm_unpacklo_epi64(b, a) yields [a2, a1,
>>> b2, b1] (upper right)? (this is SSE2, though)
>>>
>>> I have no idea how the performance compares, though. (or whether it
>>> works at all)
>>
>> You know this much better than me (honest), why don't you try it? If
>> it's faster, we'll use it. SSE2 is OK, we require it anyway for any
>> SSE code.
>>
>> Benoit
>>
>>
>>
>
>
>