Re: [eigen] Re: 4x4 matrix inverse |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
- Subject: Re: [eigen] Re: 4x4 matrix inverse
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Tue, 19 Jan 2010 16:20:11 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=nOtQCBftkK/0p4/1SA2B5mAaB0oWc040fk8eoxJQ5dY=; b=j5xfi/SSrC+q51hGRUyxq9MP/oI20tAo8vJyalXyy4urq/c8biZ7ej4VAwBw0snjz8 GrG8ueGuApQRP3BVshypdiOfNfrGR7D+qOz8mFb2o84JigPMU/UDe+hKHJXYqwV9bIMD FPrbkl15TXMcMH2IEr/yRapjZZI4WO/PCO5dE=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=gd6DX5urOVwNfmeK4zqCW0yYJmup2KVSDn5ZJHmlNGeohURt9gSVEBXqLQnopJLAY8 6gx07BMQ561IZLwtKMGYdT81JjyF+245ratkhL5aiPCIZWTbqTxDOUVAjS5MZ/rSK8E2 671j33IFQ5jWndyUMlTIQutN4fT4CI13+9hi4=
B. Ober told us about a more recent version of Intel's SSE fast
inversion code which can be found here:
http://software.intel.com/en-us/articles/optimized-matrix-library-for-use-with-the-intel-pentiumr-4-processors-sse2-instructions/
It takes advantage of SSE2 instructions, and there is also a version
for double, and all of this with a clearer license.
Both versions (floats and doubles) are already in the devel branch. If
you wonder, here are some results for 10,000,000 inversions:
float, no SSE : 1.72s
float, SSE (previous version): 0.29s
float, SSE2 (new version): 0.26s
double, no SSE: 1.72
double, SSE2: 0.45
(core2 2.66GHz, gcc 4.4, linux 64bits)
gael
On Tue, Dec 15, 2009 at 2:38 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2009/12/15 mmoll <Markus.Moll@xxxxxxxxxxxxxxxx>:
>> Hi
>>
>> Quoting Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> Actually, these lines were not equivalent to loads !
>>>
>>> When you look at this,
>>>
>>> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)),
>>> (__m64*)(src+ 4));
>>>
>>> The second half is loaded from src+4, not src+2.
>>>
>>> What is being loaded here is the top-left 2x2 corner of the matrix.
>>
>> Ah, I was wondering what the purpose was. But can't the same be achieved
>> by a combination of
>>
>> 1. aligned loads of the matrix rows into say a, b, c, d (a=[a4,a3,a2,a1]
>> and so on)
>> 2. unpack quad word pairs: _mm_unpackhi_epi64(b, a) apparently yields
>> [a4,a3,b4,b3] (upper left) and _mm_unpacklo_epi64(b, a) yields [a2, a1,
>> b2, b1] (upper right)? (this is SSE2, though)
>>
>> I have no idea how the performance compares, though. (or whether it
>> works at all)
>
> You know this much better than me (honest), why don't you try it? If
> it's faster, we'll use it. SSE2 is OK, we require it anyway for any
> SSE code.
>
> Benoit
>
>
>