[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: [eigen] SSE square root
- From: Rohit Garg <rpg.314@xxxxxxxxx>
- Date: Fri, 27 Mar 2009 13:02:05 +0530
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=uO4pVLo2O9uxv6KWcwkufAuGtH8slN1iq//m41jDE+Y=; b=KIB+qb+yLJyOSvCJYv88Ux4y8l0EqshuPQlfqcIsRnOHYM7Se0hwB/IZXTnUhbMbxE z/5tjeGtM6YAFXddc2MXxAhZ62+PfWoPcvuW9vr7UVnb2GNioz8MyrGT4FgD4bvGG5Bz NozUkYN8O9noTV+GHpbIHbNbjD3eVXH4K4G3I=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=kcB2jMqP2gnwZtwP+2gMZ138kov6odsFZGOyArClCMIS4WboEjq9MXeXV9Uk92OCYP JOdMs2zulhYAx3eBfJxHFu1QQ37fnFmhcY9fgIPlkiHRZ1NJuUwXB+ostqpwpfilnFg8 kEm1pFBu3/FBGb7nPul0F9ygHBrKWST50wJLo=
This file has my sse float implementation for square root. The SSE
square root instruction has only 12 bits of precision so extra
iterations of Newton Raphson may be neccessary. How many of the are
neccessary, I don't know. the max error I was getting was O(1e-8) in
[0,1]. The cephes implementation has square root only for limited
range. They do some other hacks to take care of range. I'll look into
implementing those later. For now, this should be an acceptable for
the fast implementations of square root atleast.
Regards,
--
Rohit Garg
http://rpg-314.blogspot.com/
Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
#include<iostream>
#include<cmath>
#include<pmmintrin.h>
using namespace std;
const __m128 half=_mm_set1_ps(0.5);
__m128 ei_psqrt(const __m128 x)
{
__m128 x1=_mm_sqrt_ps(x);
//now 3 Newton_Raphson iterations
__m128 x2=_mm_mul_ps(half, _mm_add_ps(x1,_mm_div_ps(x,x1)));
__m128 x3=_mm_mul_ps(half, _mm_add_ps(x2,_mm_div_ps(x,x2)));
return _mm_mul_ps(half, _mm_add_ps(x3,_mm_div_ps(x,x3)));
}
const int MAX=100000;
float __attribute__((aligned(16))) in[MAX];
float __attribute__((aligned(16))) out[MAX];
int main()
{
int i;
__m128 temp;
float err, maxerr;
maxerr=0;
for(i=0; i<MAX; i++)
in[i]=float(i)/MAX;
for(i=0; i< (MAX>>2); i++)
{
temp=ei_psqrt(_mm_load_ps(in+4*i));
_mm_store_ps(out+4*i,temp);
//verify
err=(out[i]-sqrtf(in[i]));
maxerr=max(err,maxerr);
err=(out[i+1]-sqrtf(in[i+1]));
maxerr=max(err,maxerr);
err=(out[i+2]-sqrtf(in[i+2]));
maxerr=max(err,maxerr);
err=(out[i+3]-sqrtf(in[i+3]));
maxerr=max(err,maxerr);
}
cout<<maxerr;
return 0;
}