Re: [eigen] SSE square root |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] SSE square root
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Fri, 27 Mar 2009 08:12:05 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=Ewa6dwdHzsRhxhepUiqlVvRGrVgLFBd0Jnph5qim828=; b=PM7fQomZ6rwtsJophoixUCRZvDKXqdSVL2yY2Fd7VmLSRRJrP5jfPJS4sCDFeC0tR2 Bgt0f6/2xErWPW2EB4YuzAoNoK8+hpD8KOfD78VVF4kE4wRFAUzK20fp7xgyVEqpCKTG pA03RkIXjBchua32bCI6w6m+VDbgfhde1My7s=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=efQt7mXYobjHeL0ZFA7sRBzkc0uZmD3v+xOSIEYQjajONxVDEwzr0bTrVqK45PgCDy L5GsosoYQ1ch9JwrdgJlB+uFKhXBFzhYqMW+9iHtfFeDnD+7ktBxZU+aOGegINeKFiOu ys55i4FGRlOR4r2YlNg3Voel2SToKhkp1tWtM=
Wonderful job again, 2 questions:
1) if you add a newton step to the last version, the error becomes 0,
right? as (2e-7)^2=4e-14 which is 0 for floats. Then how is
performance affected?
2) maybe rename TranscendentalFunctions to just MathFunctions as sqrt
is not a transcendental function...
cheers,
Benoit
2009/3/27 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> On Fri, Mar 27, 2009 at 12:59 PM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>> The difference in the last two makes no sense to me. Why should
>> interchanging the order of iteration and multiplication result in
>> ~1.7x perf difference. instruction pipelining?
>
> the iterations are not the same. in the first case it involves an
> expensive div.
>
>> Which version of gcc did you use btw?
>
> upcoming gcc 4.4
>
>> On Fri, Mar 27, 2009 at 5:23 PM, Gael Guennebaud
>> <gael.guennebaud@xxxxxxxxx> wrote:
>>> here is what I get:
>>>
>>> fpu: 66M scalar / sec
>>>
>>> _mm_sqrt_ps: 250M scalar sqrt / sec ; max error = 0
>>>
>>> _mm_rsqrt_ps followed by x* (1/sqrt(x)) and one iteration: 378M scalar
>>> sqrt / sec ; max error 1e-7
>>>
>>> _mm_rsqrt_ps followed by one iteration to get an accurate 1/sqrt(x)
>>> followed by one mul: 635M scalar sqrt /sec ; max error: 2e-7
>>> (using 2 iterations does not improve the accuracy)
>>>
>>> I'm testing in the range [0:1e5], and for the reference I convert the
>>> float values to double and call the libc sqrt function.
>>>
>>> I guess the last version is the winner. For information here it is:
>>>
>>> template<> Packet4f ei_psqrt(Packet4f _x)
>>> {
>>> Packet4f half = ei_pmul(_x, ei_pset1(.5f));
>>> Packet4f x = _mm_rsqrt_ps(_x);
>>> x = ei_pmul(x, ei_psub(ei_pset1(1.5f), ei_pmul(half, ei_pmul(x,x))));
>>> x = ei_pmul(_x,x);
>>> return x;
>>> }
>>>
>>>
>>> On Fri, Mar 27, 2009 at 11:38 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>> On Fri, Mar 27, 2009 at 3:58 PM, Gael Guennebaud
>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>> On Fri, Mar 27, 2009 at 8:32 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>> This file has my sse float implementation for square root. The SSE
>>>>>> square root instruction has only 12 bits of precision so extra
>>>>>
>>>>> where did you find sqrtss or sqrtps has only 12 bits of precision ?
>>>>
>>>> This info is from the CUDA classes. The lectures put up there say that
>>>> the precision for square root is only 12 bits. Now I need to confirm.
>>>> Your idea for a approximate reciprocal square root, a mul, and 1
>>>> iteration is a good one. Let me try that.
>>>>
>>>> --
>>>> Rohit Garg
>>>>
>>>> http://rpg-314.blogspot.com/
>>>>
>>>> Senior Undergraduate
>>>> Department of Physics
>>>> Indian Institute of Technology
>>>> Bombay
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Rohit Garg
>>
>> http://rpg-314.blogspot.com/
>>
>> Senior Undergraduate
>> Department of Physics
>> Indian Institute of Technology
>> Bombay
>>
>>
>>
>
>
>