[AD] fixed point `fsqrt' and `fhypot' |
[ Thread Index |
Date Index
| More lists.liballeg.org/allegro-developers Archives
]
When I tried to optimize the particle effect engine of a freeware game that
I'm currently working on, I found out that the `fsqrt' routine is extremely
slow. Here's my version which does excactly the same and works 15 times faster
at my P133 (it uses a `bsr' instruction instead of a series of comparisons
to reduce the range of the input number):
==============================================================================
inline fixed fsqrt (const fixed x)
{
/* This routine uses the following idea:
* sqrt (x) = sqrt (x/d) * sqrt(d)
* d = 2^(2n)
* <=> sqrt (x) = sqrt (x / 2^(2n)) * 2^n
* i386 ASM `bsr' instruction is used for getting `2n' so that (x/d) falls
* in the range 0..255. The square root is then calculated using the lookup
* table `fsqrt_table'.
*/
/* Making the if-block this way reduces the number of conditional branches
* to be taken and should run more efficient...
*/
if (x > 0) {
#if (defined ALLEGRO_GCC) && (defined ALLEGRO_I386)
short cx __attribute__ ((__unused__));
fixed result;
asm (
" movl %2, %0;" /* get shift-count `2n' by scanning `x' */
" shrl $7, %0;"
" xorb %%cl, %%cl;" /* if no bit found: default %cl = 2n = 0 */
" bsrl %0, %%ecx;"
" incb %%cl;" /* make it even --> %cl = 2n */
" andb $0xFE, %%cl;"
" movl %2, %0;"
" shrl %%cl, %0;" /* shift x to fall into range 0..256 */
" movzwl %3(,%0,2), %0;" /* table lookup... */
" shrb $1, %%cl;" /* %cl = n */
" shll %%cl, %0;" /* multiply `sqrt(x/2^(2n))' by `2^n' */
" shrl $4, %0;" /* adjust the result */
: "=&r" (result), /* %0 = result */
"=&c" (cx) /* reliably reserve cx (!?) */
: "rm" (x), /* %2 = x */
"m" (sqrt_table[0]) /* %3 = address of lookup table */
: "%cc" /* clobbers flags */
); return (result);
#else
return (ftofix (sqrt (fixtof (x))));
#endif
}
if (x < 0)
*allegro_errno = EDOM;
return 0;
}
==============================================================================
It does all the necessary error checking etc., so you just have to replace
your square root routine by this one. This routine is also much smaller,
couldn't it be implemented inline in `al386gcc.h'? In that case make sure you
export the `sqrt_table' lookup table, which is currently defined static.
My game uses the square root for calculating `sqrt(fmul(x,x)+fmul(y,y))', the
Pythagoras formula. Unfortunately `fmul(x,x)' will overflow if x > 256. So I
always had to use the formula `sqrt(fmul(x>>4,y>>4),fmul(y>>4,y>>4))<<4' to
be able to work with numbers till 4096. Now I implemented an ASM routine which
does this for fixed point numbers in any range by working with 64 bit
intermediate results. In analogy to the libc routine `hypot' I named it
`fhypot':
==============================================================================
/* fhypot:
* Return fixed point `sqrt (x*x+y*y)', which is the length of the
* hypotenuse of a right triangle with sides of length x and y, or the
* distance of point (x|y) from the origin. This routine is faster and more
* accurate than using the direct formula `fsqrt (fmul (x,x), fmul(y,y))'.
* It will also return correct results for x>=256 or y>=256 where `fmul(x)'
* or `fmul(y)' would overflow.
*/
inline fixed fhypot (const fixed x, const fixed y)
{
/* The idea of this routine is:
* sqrt (x^2+y^2) = sqrt ((x/d)^2+(y/d)^2) * d
* d = 2^n
* `d' has to be chosen so that (x/d)^2 doesn't overflow. Since `x' and `y'
* are fixed point numbers, they are multiplied in the following way:
* x^2 = (x*x)/2^16
* so we come to the formula:
* sqrt(x^2+y^2) = sqrt((x*x + y*y)/2^(16+2n)) * 2^n
* and this is almost the same problem as calculating the square root in
* `sqrt': To find the `n' that results in `(x*x+y*y)/2^(16+2n)' being
* in the range 0..255 so that we can use the lookup table.
*/
#if (defined ALLEGRO_GCC) && (defined ALLEGRO_I386)
fixed ebx __attribute__ ((__unused__));
fixed ecx __attribute__ ((__unused__));
fixed edx __attribute__ ((__unused__));
fixed result;
asm (
" movl %4, %%eax;" /* edx:eax = x*x */
" imull %%eax;"
" movl %%eax, %%ebx;" /* ecx:ebx = x*x */
" movl %%edx, %%ecx;"
" movl %5, %%eax;" /* edx:eax = y*y */
" imull %%eax;"
" addl %%ebx, %%eax;" /* edx:eax = x*x + y*y */
" adcl %%ecx, %%edx;"
" cmpl $0x3FFFFFFF, %%edx;" /* check for overflow */
" jbe 0f;"
" movl %8, %0 ; " /* on overflow, set errno */
" movl %7, (%0) ; "
" movl $0x7FFFFFFF, %0 ; " /* and return MAXINT */
" jmp 1f;"
"0: "
/* And now we're doing a bit-scan to make (x*x+y*y) fall in the range
* 0..255. Since the intermediate result is 48 bit and we cannot scan
* more than 32 bit we'll first have to make to reduce the range to
* 0..65535 and than reduce to 0..255 in a second step. So the
* corresponding formula would be:
* sqrt(x^2+y^2) = sqrt((x*x + y*y)/2^(16+2*n1+2*n2)) * 2^(n1+n2)
*/
" movb $-1, %%cl;" /* %cl = -1 (default value if no bit set) */
" bsr %%edx, %%ecx;" /* %cl = 2*n1-1; if bit0 is set we'll have to
* shift by 1 but `bsr' returns 0 */
" incb %%cl;" /* -> adjust %cl */
" incb %%cl;" /* %cl = 2*n1 -- make sure it's even */
" andb $0xFE, %%cl;"
" shrdl %%cl, %%edx, %%eax;" /* eax = (x*x+y*y)/2^(2*n1) */
" shrl $16, %%eax;" /* eax = (x*x+y*y)/2^(16+2*n1) */
" movb %%cl, %%bl;" /* bl = 2n1 */
" movl %%eax, %%edx;" /* edx = (x*x+y*y)/2^(16+2*n1) */
" shrl $7, %%edx;" /* do another scan on edx to get n2 */
" xorb %%cl, %%cl;"
" bsrw %%dx, %%cx;"
" incb %%cl;" /* %cl = 2*n2 -- make sure it's even */
" andb $0xFE, %%cl;"
" shrl %%cl, %%eax;" /* eax = (x*x+y*y)/2^(16+2*n1+2*n2) */
" movzwl %6(,%%eax,2), %%eax;" /* eax = table lookup square root */
" addb %%bl, %%cl;" /* %cl = 2*n1+2*n2 */
" xorl %%edx, %%edx;"
" shrb $1, %%cl;" /* %cl = n1+n2 */
" shldl %%cl, %%eax, %%edx;" /* multiply result with 2^(n1+n2)... */
" shll %%cl, %%eax;"
" shrdl $4, %%edx, %%eax;" /* adjust lookup table value */
"1: "
: "=&a" (result), /* result=%0 in eax */
"=&d" (edx), /* reliably reserve edx, ecx and ebx (!?) */
"=&c" (ecx),
"=&b" (ebx)
: "rm" (x), /* %4 = x (in memory or register) */
"rm" (y), /* %5 = y (in memory or register) */
"m" (sqrt_table[0]), /* %6 = address of lookup table */
"i" (ERANGE), /* %7 = ERANGE */
"m" (allegro_errno) /* %8 = allegro_errno */
: "%cc", "memory" /* clobbers flags and errno */
); return (result);
#else
return (hypot (ftofix (fixtof (x), fixtof (y))));
#endif
}
==============================================================================
This routine is 5 times faster and more accurate than using the direct formula
with the already optimized `fsqrt' routine. (It takes about 1.7 times the time
of my `fsqrt'). I think the `fhypot' routine would be very useful for 2D
vectors in 2D games. For my particle effects it is of extreme importance.
Using it improved my game's total speed by 15%. So I would like very much
if you could add that routine to the allegro library (but maybe you shouldn't
make it inline). Both routines were tested with Linux-EGCS and DJGPP. They
should be completely bugfree.
If you answer me, please note that I'm not suscribed to that mailing list. I've
an email account at my school which I'm sometimes able to read:
dkuehlin@xxxxxxxxxx
I hope your font makes a difference between `l' (L) and `1' (one), else you'll
answer to the wrong address ;-). Please do not reply that mail to where it came
from since it was sent by my brother (I won't have access to my email account
for 2 weeks).
Bye,
David Kühling
PS: Allegro is great!
PS: The "from" fieldr of this mail is not dkuehlin@... for he temporarily
doesn't have access to his e-mail accout. Regard me as a kind of a
"forwarder" ;-). Consider cc'ing any answer to this message to me.
Felix Kuehling