Hello,

My name is Jose Antonio Luque (SASK), i use allegro for my games and i am very happy to do this stuff for the allegro library.

I write a new linear_blit16 with mmx support, and masked_blit16 with 32bit internal work.
I added to my code the new clear_to_color16 with mmx support .

The new code work much faster and i think that is stable, but i only tested some hours, and using the allegro examples all work corretly, I profile the results for DJGPP.
I added to the blitting routines the mmx extension and some 32bit alignments for skip the mask fast. I only rewrite the 16bit version because is my favourite color depth, 16bit is fast than 32bit and 16bit have good pallete of 65535 colors, not only 256 poor colors.
If the code is added to allegro, i wil be able to work in 8bit version for alignment and mmx, I think that in this depth the alignment and mmx will work fastest.
I test the new compiled library building all the examples and tests, and all work fine. The test reports great benefits on blits and masked_blits. The DJGPP version work more fast than MSVC version, (why?), but in the two version the new code is more fast.
I test the new code with bitmaps for dobble buffer, and only is 1 frame more fast than old version, but if I see the profile the new code would be able blit 5 frames more by second.
The dobble buffer is for 640x480x16 (= 5Mb frame).
With old code my test reports 100-101 frames/second and with the new code 101-102.
My test is build for MSVC and DJGPP and two reports 100-101 frames-seconds.
* Why DJGPP allegro test is more fast than MSVC allegro test?
* Then why my dobble buffer reports the same values for two versions?
* Why only gain one frame/second?

Please, somebody can answer me this questions?

I only test the new code with DJGPP and MSVC and seems stable. If anybody find some bug plese report me and I repair it.

Thanks.

Allegro 3.9.30 (WIP), djgpp profile results
Graphics driver: VESA 3.0
Description: VESA 3.0 (Universal VESA VBE 6.53 (VxD)), linear
Screen size: 640x480 Virtual screen size: 1024x1024 Color depth: 16 bpp
PC: Intel Pentium 166 MMX , 32MB EDO, S3 VIRGE 325.

Hardware acceleration: <none>

Profile SCREEN MEMORY
Times per second old new

no mmx

new

mmx

old new

no mmx

new

mmx

clear_to_color() 8588 8588 8904 10624 10624 18539
vram->vram blit() 963 962 1348 N/A N/A N/A
aligned vram->vram blit() 993 992 1376 N/A N/A N/A
blit() from memory 6963 7001 9337 13434 13398 20117
aligned blit() from memory 16696 16734 17448 18770 18742 26304
vram->vram masked_blit() N/A N/A N/A N/A N/A N/A
masked_blit() from memory 6363 9515 9515 6384 9545 9545



Clear bitmap is 64x64, this is 8192 bytes
Bitmap for other test is 64x32, this is 4096 bytes
SPEED SCREEN MEMORY
in Mbytes/second old

no mmx

new

mmx

old

no mmx

new

mmx

clear_to_color() 70.35 72.94 87.03 151.87
vram->vram blit() 3.94 5.52 - -
aligned vram->vram blit() 4.06 5.64 - -
blit() from memory 26.09 38.24 55.03 82.40
aligned blit() from memory 68.39 71.47 76.88 107.74
vram->vram masked_blit() - - - -
masked_blit() from memory 26.06 38.97 26.15 39.10



- clear_to_color() is 3.5% faster for vram and 74.5% faster for memory.
- blit() for vram->vram is 40% faster.
- blit() for memory->vram is 46% faster. Aligned 4.5% faster.
- blit() for memory->memory is 50% faster. Aligned 40% faster.

- masked_blit() is 50% faster.

The MSVC version old code reports bad values that DJGPP, for example:

DJGGP vram->vram blit(); old: 6963 new: 9337

MSVC vram->vram blit(); old: 6172 new: 8026

but the for MSVC new code is more fast than old too.

I think that are great results.





What are the changes?



Ok, first linear_blit16:
I added mmx code for fast read/write memory copy.

The mmx version align the width of bitmap for long64 move and at the end move the rest of pixels (0-3). This version works faster when the width of bitmap is long64 aligned (width=number*4). If the width is not long64 align the code have to do the align.

SECRET: Use bitmaps with width long64 aligned. (4,8,12,16,...).



The masked_blit16:

I added 32bit work, the code read 32bit, two pixels, and work with this.
This code only is used if bitmap witdh is long align. (Width even, not odd). (2,4,6,8,10,...).
Why this is more fast 'read 1,work, read 1, work' that 'read2,work,work'?
No is more fast. But the code can skip two pixels at time and this is fast that skip one and skip one. And read 32bit is equal to read 16bit.
Why not a mmx version?
I said -Why no? My mmx version work slowest than my 32bit version.

SECRET: Bitmaps with not pair width are worse.