Re: [eigen] SSE square root

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Fri, Mar 27, 2009 at 1:12 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> Wonderful job again, 2 questions:
> 1) if you add a newton step to the last version, the error becomes 0,
> right? as (2e-7)^2=4e-14 which is 0 for floats. Then how is
> performance affected?

the problem is that doing 2 iterations instead of one does not improve
at the accuracy.... (max error = 2 bits, perf = 410M / sec). Same for
the first version inspired from Rohit code:

template<> Packet4f ei_psqrt(Packet4f _x)
{
  Packet4f x = ei_pmul(_x,_mm_rsqrt_ps(_x));
  x = ei_pmul(ei_pset1(.5f), ei_padd(x,ei_pdiv(_x,x)));
  return x;
}

doing 2 iters does not improve accuracy (max error = 1 bit)


> 2) maybe rename TranscendentalFunctions to just MathFunctions as sqrt
> is not a transcendental function...

sure.

> cheers,
> Benoit
>
> 2009/3/27 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> On Fri, Mar 27, 2009 at 12:59 PM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>> The difference in the last two makes no sense to me. Why should
>>> interchanging the order of iteration and multiplication result in
>>> ~1.7x perf difference. instruction pipelining?
>>
>> the iterations are not the same. in the first case it involves an
>> expensive div.
>>
>>> Which version of gcc did you use btw?
>>
>> upcoming gcc 4.4
>>
>>> On Fri, Mar 27, 2009 at 5:23 PM, Gael Guennebaud
>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>> here is what I get:
>>>>
>>>> fpu: 66M scalar / sec
>>>>
>>>> _mm_sqrt_ps: 250M  scalar sqrt / sec ; max error = 0
>>>>
>>>> _mm_rsqrt_ps followed by x* (1/sqrt(x)) and one iteration: 378M scalar
>>>> sqrt / sec ; max error 1e-7
>>>>
>>>> _mm_rsqrt_ps followed by one iteration to get an accurate 1/sqrt(x)
>>>> followed by one mul: 635M scalar sqrt /sec ; max error: 2e-7
>>>> (using 2 iterations does not improve the accuracy)
>>>>
>>>> I'm testing in the range [0:1e5], and for the reference I convert the
>>>> float values to double and call the libc sqrt function.
>>>>
>>>> I guess the last version is the winner. For information here it is:
>>>>
>>>> template<> Packet4f ei_psqrt(Packet4f _x)
>>>> {
>>>>  Packet4f half = ei_pmul(_x, ei_pset1(.5f));
>>>>  Packet4f x = _mm_rsqrt_ps(_x);
>>>>  x = ei_pmul(x, ei_psub(ei_pset1(1.5f), ei_pmul(half, ei_pmul(x,x))));
>>>>  x = ei_pmul(_x,x);
>>>>  return x;
>>>> }
>>>>
>>>>
>>>> On Fri, Mar 27, 2009 at 11:38 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>> On Fri, Mar 27, 2009 at 3:58 PM, Gael Guennebaud
>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>> On Fri, Mar 27, 2009 at 8:32 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>>> This file has my sse float implementation for square root. The SSE
>>>>>>> square root instruction has only 12 bits of precision so extra
>>>>>>
>>>>>> where did you find sqrtss or sqrtps has only 12 bits of precision ?
>>>>>
>>>>> This info is from the CUDA classes. The lectures put up there say that
>>>>> the precision for square root is only 12 bits. Now I need to confirm.
>>>>> Your idea for a approximate reciprocal square root, a mul, and 1
>>>>> iteration is a good one. Let me try that.
>>>>>
>>>>> --
>>>>> Rohit Garg
>>>>>
>>>>> http://rpg-314.blogspot.com/
>>>>>
>>>>> Senior Undergraduate
>>>>> Department of Physics
>>>>> Indian Institute of Technology
>>>>> Bombay
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Rohit Garg
>>>
>>> http://rpg-314.blogspot.com/
>>>
>>> Senior Undergraduate
>>> Department of Physics
>>> Indian Institute of Technology
>>> Bombay
>>>
>>>
>>>
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/