Re: [eigen] Re: 4x4 matrix inverse |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] Re: 4x4 matrix inverse*From*: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>*Date*: Tue, 15 Dec 2009 06:52:17 -0500*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=EfPejpoRdVS3lA8WdhW4v8vkEmcc8IPxJDyA35WzS64=; b=qMxhWy/qTlFXawrgGmXfPMK9GXb25IgAnIX1z5+z5A8EWZy41UxSjM5kWw1gJb3tFO DdqkAZFBR6+riXCUueuP/q4Z6D+SjUXa4xgPT97idbl/403zExn3sP3+3Fjg2nGmBm23 gL1mef/3rZKZBr8vbFTeziJhRildBoLORSdow=*Domainkey-signature*: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=n38KSllpLQ8h2gLtvAWrRw06dUwZWy501/G6nkhAMfZVJus18WoWESJeYM86YcKSSy 1UyC6uuD2QOg84/NEwTnA9HyHNMlerHPUio+NqAO85s9+hTO+yCnW0mDCm9/G6URzZnS dhEIp9KZxor4/UqHowacGM/WbRlEnaelf0PEY=

2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>: > > > On Tue, Dec 15, 2009 at 5:25 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> > wrote: >> >> Hi, >> >> To summarize recent commits: all this is now done in the development >> branch, it only remains to consider backporting. >> >> The SSE code is 4.5x faster than my plain scalar path! I guess that's >> explained not only by SSE intrinsics but also by better ordering of >> instructions... >> >> There is one thing where I didn't follow Intel's code: they use a >> RCPSS instruction to compute 1/det approximately, then followed by a >> Newton-Raphson iteration. This sacrifices up to 2 bits of precision in >> the mantissa, which already is a bit nontrivial for us (4x4 matrix >> inversion is a basic operation on which people will rely very >> heavily). To help solve that dilemma (performance vs precision) I >> benchmarked it, and it turns out than on my core i7, DIVSS is >> slightly faster !! Intel's paper was written for the pentium 3 so >> that's perhaps not surprising, but I saw forum posts mentioning that >> the RCPSS trick is still faster on the Core2. If you want to test, see >> lines 128-130 in Inverse_SSE.h. >> >> I have a question. I currently get warnings in this code (taken >> straight from Intel): >> >> __m128 tmp1; >> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ >> 4)); > > why don't you use tmp1 = ei_pload(src); since we know src will be aligned ? You're right, I didn't look carefully at this line from Intel, and just below, there are these lines: tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)), (__m64*)(src+ 6)); row3 = _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)), (__m64*)(src+14)); which cannot use a ei_pload since they use non-multiple-of-16-bytes offsets, and I was confused from there. By the way, is this trick (to load from 64-bit aligned addresses) worth abstracting into a "ei_pload8" function? It's probably faster than completely unaligned loads... Thanks for the tip, Benoit > > gael. > >> >> The warning claims that tmp1 is used uninitalized here. GCC doesn't >> understand that it only is passed to _mm_loadl_pi that writes into it, >> does not read from it. How to fix that warning? I tested initializing >> tmp1, this had a not-totally-negligible impact on performance (because >> there are 2 more variables that need this). There does not seem to be >> an __attribute__ for this. >> >> Benoit >> >> 2009/12/4 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>: >> > Hi, >> > >> > Long ago I thought it would be a good idea to optimize 4x4 matrix >> > inverse using "Euler's trick" which reduced greatly the number of >> > operations but relies on some 2x2 block inside the matrix being >> > invertible. >> > >> > The problem is that this gives bad precision, and the best compromise >> > that I could find between precision and performance is still: >> > - 10x more imprecise in the worst case >> > - only 25% faster. >> > >> > My last reason to clinge to this approach is that it was supposedly >> > more vectorizable, but reading this, >> > ftp://download.intel.com/design/PentiumIII/sml/24504301.pdf >> > I realized that Intel engineers actually figured how to vectorize the >> > plain old cofactors approach very efficiently. >> > >> > So I'll switch to cofactors in both branches, I think. I'll also >> > implement SSE at least in the default branch. >> > >> > Question: do you think that Intel's code is provided free of use? Or >> > should I avoid looking at it? Even if I can't look at it, they still >> > provide good explanations. >> > >> > Benoit >> > >

**Follow-Ups**:**Re: [eigen] Re: 4x4 matrix inverse***From:*Gael Guennebaud

**References**:**[eigen] 4x4 matrix inverse***From:*Benoit Jacob

**[eigen] Re: 4x4 matrix inverse***From:*Benoit Jacob

**Re: [eigen] Re: 4x4 matrix inverse***From:*Gael Guennebaud

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] Re: 4x4 matrix inverse** - Next by Date:
**Re: [eigen] Re: 4x4 matrix inverse** - Previous by thread:
**Re: [eigen] Re: 4x4 matrix inverse** - Next by thread:
**Re: [eigen] Re: 4x4 matrix inverse**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |