Re: [eigen] SSE square root |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] SSE square root
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Fri, 27 Mar 2009 13:05:14 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=srcdWN6qfhfOKXqUMrKkK0ExYf4hP0gsKi2zhYVk1gg=; b=UurU0qowt3etw64qWtPe6dEJOqSnvRdsBAQo1ayA3veDt3S9dFiyYmDuBAM4vxV9w4 5MQXn0T2wqmFLFWaFmIj++k+4nzP6//BfAhWbqU9A2EfGWUCXSnQmzx48ilAJyDHiIw6 e0FX/1W+uFmzBhZkhIvYPDqJhrqu6zZnjElhQ=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Ir7dlqSCcD4QJpzRmvlhNuOzjLAKJP9Z7F9+uCe7+cxdRb+hum4VIlUKh/nUdhO78T 9hNN5fdr7XdrONOtlEp0kwJsITMRH6vppHN9xwmpBeHDZcleWRFWb03oR8o0M4Cv5V6z z39CfiEElribHmmBVM3lHyQsfMaVHDbJh9btw=
On Fri, Mar 27, 2009 at 12:59 PM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
> The difference in the last two makes no sense to me. Why should
> interchanging the order of iteration and multiplication result in
> ~1.7x perf difference. instruction pipelining?
the iterations are not the same. in the first case it involves an
expensive div.
> Which version of gcc did you use btw?
upcoming gcc 4.4
> On Fri, Mar 27, 2009 at 5:23 PM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> here is what I get:
>>
>> fpu: 66M scalar / sec
>>
>> _mm_sqrt_ps: 250M scalar sqrt / sec ; max error = 0
>>
>> _mm_rsqrt_ps followed by x* (1/sqrt(x)) and one iteration: 378M scalar
>> sqrt / sec ; max error 1e-7
>>
>> _mm_rsqrt_ps followed by one iteration to get an accurate 1/sqrt(x)
>> followed by one mul: 635M scalar sqrt /sec ; max error: 2e-7
>> (using 2 iterations does not improve the accuracy)
>>
>> I'm testing in the range [0:1e5], and for the reference I convert the
>> float values to double and call the libc sqrt function.
>>
>> I guess the last version is the winner. For information here it is:
>>
>> template<> Packet4f ei_psqrt(Packet4f _x)
>> {
>> Packet4f half = ei_pmul(_x, ei_pset1(.5f));
>> Packet4f x = _mm_rsqrt_ps(_x);
>> x = ei_pmul(x, ei_psub(ei_pset1(1.5f), ei_pmul(half, ei_pmul(x,x))));
>> x = ei_pmul(_x,x);
>> return x;
>> }
>>
>>
>> On Fri, Mar 27, 2009 at 11:38 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>> On Fri, Mar 27, 2009 at 3:58 PM, Gael Guennebaud
>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>> On Fri, Mar 27, 2009 at 8:32 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>> This file has my sse float implementation for square root. The SSE
>>>>> square root instruction has only 12 bits of precision so extra
>>>>
>>>> where did you find sqrtss or sqrtps has only 12 bits of precision ?
>>>
>>> This info is from the CUDA classes. The lectures put up there say that
>>> the precision for square root is only 12 bits. Now I need to confirm.
>>> Your idea for a approximate reciprocal square root, a mul, and 1
>>> iteration is a good one. Let me try that.
>>>
>>> --
>>> Rohit Garg
>>>
>>> http://rpg-314.blogspot.com/
>>>
>>> Senior Undergraduate
>>> Department of Physics
>>> Indian Institute of Technology
>>> Bombay
>>>
>>>
>>>
>>
>>
>>
>
>
>
> --
> Rohit Garg
>
> http://rpg-314.blogspot.com/
>
> Senior Undergraduate
> Department of Physics
> Indian Institute of Technology
> Bombay
>
>
>