[eigen] SSE square root

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


This file has my sse float implementation for square root. The SSE
square root instruction has only 12 bits of precision so extra
iterations of Newton Raphson may be neccessary. How many of the are
neccessary, I don't know. the max error I was getting was O(1e-8) in
[0,1]. The cephes implementation has square root only for limited
range. They do some other hacks to take care of range. I'll look into
implementing those later. For now, this should be an acceptable for
the fast implementations of square root atleast.

Regards,

-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay
#include<iostream>
#include<cmath>
#include<pmmintrin.h>
using namespace std;

const __m128 half=_mm_set1_ps(0.5);

__m128 ei_psqrt(const __m128 x)
    {
    __m128 x1=_mm_sqrt_ps(x);
//now 3 Newton_Raphson iterations
    __m128 x2=_mm_mul_ps(half, _mm_add_ps(x1,_mm_div_ps(x,x1)));
    __m128 x3=_mm_mul_ps(half, _mm_add_ps(x2,_mm_div_ps(x,x2)));
    return    _mm_mul_ps(half, _mm_add_ps(x3,_mm_div_ps(x,x3)));
    }

const int MAX=100000;
float __attribute__((aligned(16))) in[MAX];
float __attribute__((aligned(16))) out[MAX];

int main()
    {
    int i;
    __m128 temp;
    float err, maxerr;
    maxerr=0;
    for(i=0; i<MAX; i++)
	in[i]=float(i)/MAX;
    for(i=0; i< (MAX>>2); i++)
	{
	temp=ei_psqrt(_mm_load_ps(in+4*i));
	_mm_store_ps(out+4*i,temp);
//verify
	err=(out[i]-sqrtf(in[i]));
	maxerr=max(err,maxerr);
	err=(out[i+1]-sqrtf(in[i+1]));
	maxerr=max(err,maxerr);	
	err=(out[i+2]-sqrtf(in[i+2]));
	maxerr=max(err,maxerr);
	err=(out[i+3]-sqrtf(in[i+3]));
	maxerr=max(err,maxerr);
	}
    cout<<maxerr;
    return 0;
    }


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/