|Re: [eigen] Re: 4x4 matrix inverse|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Re: 4x4 matrix inverse
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Tue, 15 Dec 2009 07:45:26 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ze99N2Cz7K7xzQnKIzgfPVPuSHTCWKkRU4Qunw9y+wM=; b=fhXMP+sF06G+5avUasYgbMYJNQZ6L7NRCBsFfkeXG1+2ON1v/T+MyFUJxmJW8HPlaT J4uIiH7rL2gTTAoNtEiewLeZk512UGVytFbAqMSP5THThSWIy9l02mxBNcp0HsDY4zqs nKcwZS8skAG5niBzZqC6hu1+JKo05sBPahpbY=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ZQxdWmUaA8eR1aT/pPtQEs5FeQdfcDZ2++2Tm1sZWkpdv3aBCEQKESlwP+A+02sNRT KhwF5apfD/8+AoG+VK9x0uwMPG8qY0gNl9XnZ0wG+Cr0Vr4TxIQCs5sh+5dK0/pTdeRF k0fAHm8jj+lEwNIyw2CpsUZ6Rc1MNR5e52dJs=
2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> On Tue, Dec 15, 2009 at 12:52 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>> 2009/12/15 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> > On Tue, Dec 15, 2009 at 5:25 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
>> > wrote:
>> >> Hi,
>> >> To summarize recent commits: all this is now done in the development
>> >> branch, it only remains to consider backporting.
>> >> The SSE code is 4.5x faster than my plain scalar path! I guess that's
>> >> explained not only by SSE intrinsics but also by better ordering of
>> >> instructions...
>> >> There is one thing where I didn't follow Intel's code: they use a
>> >> RCPSS instruction to compute 1/det approximately, then followed by a
>> >> Newton-Raphson iteration. This sacrifices up to 2 bits of precision in
>> >> the mantissa, which already is a bit nontrivial for us (4x4 matrix
>> >> inversion is a basic operation on which people will rely very
>> >> heavily). To help solve that dilemma (performance vs precision) I
>> >> benchmarked it, and it turns out than on my core i7, DIVSS is
>> >> slightly faster !! Intel's paper was written for the pentium 3 so
>> >> that's perhaps not surprising, but I saw forum posts mentioning that
>> >> the RCPSS trick is still faster on the Core2. If you want to test, see
>> >> lines 128-130 in Inverse_SSE.h.
>> >> I have a question. I currently get warnings in this code (taken
>> >> straight from Intel):
>> >> __m128 tmp1;
>> >> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)),
>> >> (__m64*)(src+
>> >> 4));
>> > why don't you use tmp1 = ei_pload(src); since we know src will be
>> > aligned ?
>> You're right, I didn't look carefully at this line from Intel, and
>> just below, there are these lines:
>> tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)),
>> (__m64*)(src+ 6));
>> row3 = _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)),
>> which cannot use a ei_pload since they use non-multiple-of-16-bytes
>> offsets, and I was confused from there.
>> By the way, is this trick (to load from 64-bit aligned addresses)
>> worth abstracting into a "ei_pload8" function? It's probably faster
>> than completely unaligned loads...
> this is more or less what we do in ei_ploadu, but using a pair of
> movsd/movhps and inline assembly to avoid GCC messing up. But currently for
> MSVC we use raw unaligned loads, so we probably should switch to a pair of
> I chose a pair of movsd/movhps because it appeared to be fastest option for
> my CPU.
> Finally, to be clear I think you can change these 2 lines with ei_ploadu and
> the perf. should be the same.
Actually, these lines were not equivalent to loads !
When you look at this,
tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));
The second half is loaded from src+4, not src+2.
What is being loaded here is the top-left 2x2 corner of the matrix.