Re: [eigen] SSE square root

[ Thread Index | Date Index | More Archives ]

here is what I get:

fpu: 66M scalar / sec

_mm_sqrt_ps: 250M  scalar sqrt / sec ; max error = 0

_mm_rsqrt_ps followed by x* (1/sqrt(x)) and one iteration: 378M scalar
sqrt / sec ; max error 1e-7

_mm_rsqrt_ps followed by one iteration to get an accurate 1/sqrt(x)
followed by one mul: 635M scalar sqrt /sec ; max error: 2e-7
(using 2 iterations does not improve the accuracy)

I'm testing in the range [0:1e5], and for the reference I convert the
float values to double and call the libc sqrt function.

I guess the last version is the winner. For information here it is:

template<> Packet4f ei_psqrt(Packet4f _x)
  Packet4f half = ei_pmul(_x, ei_pset1(.5f));
  Packet4f x = _mm_rsqrt_ps(_x);
  x = ei_pmul(x, ei_psub(ei_pset1(1.5f), ei_pmul(half, ei_pmul(x,x))));
  x = ei_pmul(_x,x);
  return x;

On Fri, Mar 27, 2009 at 11:38 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
> On Fri, Mar 27, 2009 at 3:58 PM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> On Fri, Mar 27, 2009 at 8:32 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>> This file has my sse float implementation for square root. The SSE
>>> square root instruction has only 12 bits of precision so extra
>> where did you find sqrtss or sqrtps has only 12 bits of precision ?
> This info is from the CUDA classes. The lectures put up there say that
> the precision for square root is only 12 bits. Now I need to confirm.
> Your idea for a approximate reciprocal square root, a mul, and 1
> iteration is a good one. Let me try that.
> --
> Rohit Garg
> Senior Undergraduate
> Department of Physics
> Indian Institute of Technology
> Bombay

Mail converted by MHonArc 2.6.19+