Re: [eigen] SSE square root |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] SSE square root
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Fri, 27 Mar 2009 13:30:49 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=d0cvq33UJG+Vw/5fgcJJSfS13RaHqwxSMSfch7CHadE=; b=Vqs7fXKD2InkKOPsPRLagvTwoBbkytN6e2/jdTRluPRyJkroOslqCce9mQFIWvPcar nl1k13b9JN3yT37QO9JcYxPoYftHPc7BTFeMYd3d9jO/BzBgshr3Ql00HW5nSyFjgZjU tBdR+/2b/kwWl3KtCdTVMp9OSG6DSYSFsgA54=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=fOtIRR0AW1vevdPcRoXhbVLmyxtCsII67F+eVKmzDRwkgk/R2Yz1yZiWdfaL67Wd7W mXf3fNi/tM7nqZ+j7SfGCgXpRa2OVMwggyKDgCND1yZo14hmUOWugclptENtfFn0eI2n eRqolnxxFZXhzWdOQMbRyZ3ZixGImbu5qmD/M=
On Fri, Mar 27, 2009 at 1:23 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> On Fri, Mar 27, 2009 at 1:12 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> Wonderful job again, 2 questions:
>> 1) if you add a newton step to the last version, the error becomes 0,
>> right? as (2e-7)^2=4e-14 which is 0 for floats. Then how is
>> performance affected?
>
> the problem is that doing 2 iterations instead of one does not improve
> at the accuracy.... (max error = 2 bits, perf = 410M / sec). Same for
> the version inspired from Rohit code:
>
> template<> Packet4f ei_psqrt(Packet4f _x)
> {
> Packet4f x = ei_pmul(_x,_mm_rsqrt_ps(_x));
> x = ei_pmul(ei_pset1(.5f), ei_padd(x,ei_pdiv(_x,x)));
> return x;
> }
>
> doing 2 iters does not improve accuracy (max error = 1 bit)
hm, sorry that's not entirely true. I increased the number of samples,
and for the fastest version:
1 iter = 42% of failures with max error = 2.78308e-07
2 iter = 36% of failures with max error = 2.15188e-07
3 iter = 34% of failures with max error = 1.87297e-07
>
>
>> 2) maybe rename TranscendentalFunctions to just MathFunctions as sqrt
>> is not a transcendental function...
>
> sure.
>
>> cheers,
>> Benoit
>>
>> 2009/3/27 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>>> On Fri, Mar 27, 2009 at 12:59 PM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>> The difference in the last two makes no sense to me. Why should
>>>> interchanging the order of iteration and multiplication result in
>>>> ~1.7x perf difference. instruction pipelining?
>>>
>>> the iterations are not the same. in the first case it involves an
>>> expensive div.
>>>
>>>> Which version of gcc did you use btw?
>>>
>>> upcoming gcc 4.4
>>>
>>>> On Fri, Mar 27, 2009 at 5:23 PM, Gael Guennebaud
>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>> here is what I get:
>>>>>
>>>>> fpu: 66M scalar / sec
>>>>>
>>>>> _mm_sqrt_ps: 250M scalar sqrt / sec ; max error = 0
>>>>>
>>>>> _mm_rsqrt_ps followed by x* (1/sqrt(x)) and one iteration: 378M scalar
>>>>> sqrt / sec ; max error 1e-7
>>>>>
>>>>> _mm_rsqrt_ps followed by one iteration to get an accurate 1/sqrt(x)
>>>>> followed by one mul: 635M scalar sqrt /sec ; max error: 2e-7
>>>>> (using 2 iterations does not improve the accuracy)
>>>>>
>>>>> I'm testing in the range [0:1e5], and for the reference I convert the
>>>>> float values to double and call the libc sqrt function.
>>>>>
>>>>> I guess the last version is the winner. For information here it is:
>>>>>
>>>>> template<> Packet4f ei_psqrt(Packet4f _x)
>>>>> {
>>>>> Packet4f half = ei_pmul(_x, ei_pset1(.5f));
>>>>> Packet4f x = _mm_rsqrt_ps(_x);
>>>>> x = ei_pmul(x, ei_psub(ei_pset1(1.5f), ei_pmul(half, ei_pmul(x,x))));
>>>>> x = ei_pmul(_x,x);
>>>>> return x;
>>>>> }
>>>>>
>>>>>
>>>>> On Fri, Mar 27, 2009 at 11:38 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>> On Fri, Mar 27, 2009 at 3:58 PM, Gael Guennebaud
>>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>>> On Fri, Mar 27, 2009 at 8:32 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>>>> This file has my sse float implementation for square root. The SSE
>>>>>>>> square root instruction has only 12 bits of precision so extra
>>>>>>>
>>>>>>> where did you find sqrtss or sqrtps has only 12 bits of precision ?
>>>>>>
>>>>>> This info is from the CUDA classes. The lectures put up there say that
>>>>>> the precision for square root is only 12 bits. Now I need to confirm..
>>>>>> Your idea for a approximate reciprocal square root, a mul, and 1
>>>>>> iteration is a good one. Let me try that.
>>>>>>
>>>>>> --
>>>>>> Rohit Garg
>>>>>>
>>>>>> http://rpg-314.blogspot.com/
>>>>>>
>>>>>> Senior Undergraduate
>>>>>> Department of Physics
>>>>>> Indian Institute of Technology
>>>>>> Bombay
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Rohit Garg
>>>>
>>>> http://rpg-314.blogspot.com/
>>>>
>>>> Senior Undergraduate
>>>> Department of Physics
>>>> Indian Institute of Technology
>>>> Bombay
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>