Re: [eigen] SSE square root |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] SSE square root*From*: Rohit Garg <rpg.314@xxxxxxxxx>*Date*: Fri, 27 Mar 2009 17:57:31 +0530*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=GoIo93Fu0B8FMW8V43KcTqYR6HPE9SrpoV0Ul3CHDvs=; b=cVmRr5umutX9G4yg+i5KKlUVYoBea7pXbJax3tr/rWrBzRo1vSOwuC+KxaevFeybrn HcKy3FvUR2lO3PlsvQe8Kg7D7+GVrmafXW9JbaPb6x6ZsAKw+LJUCXqznPE9M613ceNu X6TNHNijjECuUYHsHhnUgP9XspT4EMFsPAX9s=*Domainkey-signature*: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=qyDB2wx5F7NP/SBcpOb+rhTAiafDRAYdZjdBMRfNdW9hznw18d8KZaCuCFipB+TYiI OpDxkmG6HNEaAt6xRRmZbk6bgzG5fysB7KJbzGG+woEIfVfEs9dopoE4U8WGR+oMs69u Op/J9ZSZ4GPkwkiJZvMXwdvSam/4aaTEo4hc8=

1e-7 is about 22 binary bits anyway, so perhaps another step won't do any good, atleast in single precision. On Fri, Mar 27, 2009 at 5:53 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote: > On Fri, Mar 27, 2009 at 1:12 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote: >> Wonderful job again, 2 questions: >> 1) if you add a newton step to the last version, the error becomes 0, >> right? as (2e-7)^2=4e-14 which is 0 for floats. Then how is >> performance affected? > > the problem is that doing 2 iterations instead of one does not improve > at the accuracy.... (max error = 2 bits, perf = 410M / sec). Same for > the first version inspired from Rohit code: > > template<> Packet4f ei_psqrt(Packet4f _x) > { > Packet4f x = ei_pmul(_x,_mm_rsqrt_ps(_x)); > x = ei_pmul(ei_pset1(.5f), ei_padd(x,ei_pdiv(_x,x))); > return x; > } > > doing 2 iters does not improve accuracy (max error = 1 bit) > > >> 2) maybe rename TranscendentalFunctions to just MathFunctions as sqrt >> is not a transcendental function... > > sure. > >> cheers, >> Benoit >> >> 2009/3/27 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>: >>> On Fri, Mar 27, 2009 at 12:59 PM, Rohit Garg <rpg.314@xxxxxxxxx> wrote: >>>> The difference in the last two makes no sense to me. Why should >>>> interchanging the order of iteration and multiplication result in >>>> ~1.7x perf difference. instruction pipelining? >>> >>> the iterations are not the same. in the first case it involves an >>> expensive div. >>> >>>> Which version of gcc did you use btw? >>> >>> upcoming gcc 4.4 >>> >>>> On Fri, Mar 27, 2009 at 5:23 PM, Gael Guennebaud >>>> <gael.guennebaud@xxxxxxxxx> wrote: >>>>> here is what I get: >>>>> >>>>> fpu: 66M scalar / sec >>>>> >>>>> _mm_sqrt_ps: 250M scalar sqrt / sec ; max error = 0 >>>>> >>>>> _mm_rsqrt_ps followed by x* (1/sqrt(x)) and one iteration: 378M scalar >>>>> sqrt / sec ; max error 1e-7 >>>>> >>>>> _mm_rsqrt_ps followed by one iteration to get an accurate 1/sqrt(x) >>>>> followed by one mul: 635M scalar sqrt /sec ; max error: 2e-7 >>>>> (using 2 iterations does not improve the accuracy) >>>>> >>>>> I'm testing in the range [0:1e5], and for the reference I convert the >>>>> float values to double and call the libc sqrt function. >>>>> >>>>> I guess the last version is the winner. For information here it is: >>>>> >>>>> template<> Packet4f ei_psqrt(Packet4f _x) >>>>> { >>>>> Packet4f half = ei_pmul(_x, ei_pset1(.5f)); >>>>> Packet4f x = _mm_rsqrt_ps(_x); >>>>> x = ei_pmul(x, ei_psub(ei_pset1(1.5f), ei_pmul(half, ei_pmul(x,x)))); >>>>> x = ei_pmul(_x,x); >>>>> return x; >>>>> } >>>>> >>>>> >>>>> On Fri, Mar 27, 2009 at 11:38 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote: >>>>>> On Fri, Mar 27, 2009 at 3:58 PM, Gael Guennebaud >>>>>> <gael.guennebaud@xxxxxxxxx> wrote: >>>>>>> On Fri, Mar 27, 2009 at 8:32 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote: >>>>>>>> This file has my sse float implementation for square root. The SSE >>>>>>>> square root instruction has only 12 bits of precision so extra >>>>>>> >>>>>>> where did you find sqrtss or sqrtps has only 12 bits of precision ? >>>>>> >>>>>> This info is from the CUDA classes. The lectures put up there say that >>>>>> the precision for square root is only 12 bits. Now I need to confirm.. >>>>>> Your idea for a approximate reciprocal square root, a mul, and 1 >>>>>> iteration is a good one. Let me try that. >>>>>> >>>>>> -- >>>>>> Rohit Garg >>>>>> >>>>>> http://rpg-314.blogspot.com/ >>>>>> >>>>>> Senior Undergraduate >>>>>> Department of Physics >>>>>> Indian Institute of Technology >>>>>> Bombay >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Rohit Garg >>>> >>>> http://rpg-314.blogspot.com/ >>>> >>>> Senior Undergraduate >>>> Department of Physics >>>> Indian Institute of Technology >>>> Bombay >>>> >>>> >>>> >>> >>> >>> >> >> >> > > > -- Rohit Garg http://rpg-314.blogspot.com/ Senior Undergraduate Department of Physics Indian Institute of Technology Bombay

**References**:**[eigen] SSE square root***From:*Rohit Garg

**Re: [eigen] SSE square root***From:*Gael Guennebaud

**Re: [eigen] SSE square root***From:*Rohit Garg

**Re: [eigen] SSE square root***From:*Gael Guennebaud

**Re: [eigen] SSE square root***From:*Rohit Garg

**Re: [eigen] SSE square root***From:*Gael Guennebaud

**Re: [eigen] SSE square root***From:*Benoit Jacob

**Re: [eigen] SSE square root***From:*Gael Guennebaud

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] SSE square root** - Next by Date:
**Re: [eigen] SSE square root** - Previous by thread:
**Re: [eigen] SSE square root** - Next by thread:
**Re: [eigen] SSE square root**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |