Re: [eigen] SSE square root

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] SSE square root
From: Rohit Garg <rpg.314@xxxxxxxxx>
Date: Fri, 27 Mar 2009 17:29:42 +0530
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=tZqHboPGYl+hpAvTsXU6QGnT5xWnpBz75oYVMQOzRa4=; b=TdIOgg2OSmQU80aJ4XdUCS4gcG+XI27W02bHOYxhQ3INqcmc8+ZeVx1LaFpWwVlilj duceDnBniDLA9LLhMiWCTXDpWsmSvNW3pVCtdGD1IvJoCzkK7pacVeTCQ0+4OhAblzpa fw9pPA2deq8kST+7eY5pogPhW2KU5vsrxIQcQ=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Vx+PVYs85+KJoxS7e2By3gtm9c1bY1fsMsKwB4BOvnpTBO9+cxUzoOnSg1NV8fSUdN 3pS6+H3MnqyM+FEGDGtBsG0qK5yYcA2+sk1AGKga3K3AIskakDnE0lsucdWqYKToNZpi qgO9ZDlIjqMQtyt6h4croVpKgPBbYhSegjwzQ=

The difference in the last two makes no sense to me. Why should
interchanging the order of iteration and multiplication result in
~1.7x perf difference. instruction pipelining?

Which version of gcc did you use btw?

On Fri, Mar 27, 2009 at 5:23 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> here is what I get:
>
> fpu: 66M scalar / sec
>
> _mm_sqrt_ps: 250M  scalar sqrt / sec ; max error = 0
>
> _mm_rsqrt_ps followed by x* (1/sqrt(x)) and one iteration: 378M scalar
> sqrt / sec ; max error 1e-7
>
> _mm_rsqrt_ps followed by one iteration to get an accurate 1/sqrt(x)
> followed by one mul: 635M scalar sqrt /sec ; max error: 2e-7
> (using 2 iterations does not improve the accuracy)
>
> I'm testing in the range [0:1e5], and for the reference I convert the
> float values to double and call the libc sqrt function.
>
> I guess the last version is the winner. For information here it is:
>
> template<> Packet4f ei_psqrt(Packet4f _x)
> {
>  Packet4f half = ei_pmul(_x, ei_pset1(.5f));
>  Packet4f x = _mm_rsqrt_ps(_x);
>  x = ei_pmul(x, ei_psub(ei_pset1(1.5f), ei_pmul(half, ei_pmul(x,x))));
>  x = ei_pmul(_x,x);
>  return x;
> }
>
>
> On Fri, Mar 27, 2009 at 11:38 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>> On Fri, Mar 27, 2009 at 3:58 PM, Gael Guennebaud
>> <gael.guennebaud@xxxxxxxxx> wrote:
>>> On Fri, Mar 27, 2009 at 8:32 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>> This file has my sse float implementation for square root. The SSE
>>>> square root instruction has only 12 bits of precision so extra
>>>
>>> where did you find sqrtss or sqrtps has only 12 bits of precision ?
>>
>> This info is from the CUDA classes. The lectures put up there say that
>> the precision for square root is only 12 bits. Now I need to confirm.
>> Your idea for a approximate reciprocal square root, a mul, and 1
>> iteration is a good one. Let me try that.
>>
>> --
>> Rohit Garg
>>
>> http://rpg-314.blogspot.com/
>>
>> Senior Undergraduate
>> Department of Physics
>> Indian Institute of Technology
>> Bombay
>>
>>
>>
>
>
>



-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay

Follow-Ups:
- Re: [eigen] SSE square root
  - From: Gael Guennebaud

References:
- [eigen] SSE square root
  - From: Rohit Garg
- Re: [eigen] SSE square root
  - From: Gael Guennebaud
- Re: [eigen] SSE square root
  - From: Rohit Garg
- Re: [eigen] SSE square root
  - From: Gael Guennebaud

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] SSE square root
Next by Date: Re: [eigen] SSE square root
Previous by thread: Re: [eigen] SSE square root
Next by thread: Re: [eigen] SSE square root

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/