Hello,
My name is Jose Antonio Luque (SASK), i use allegro for my games and i am very happy to do this stuff for the allegro library.
I write a new linear_blit16 with mmx support, and masked_blit16 with 32bit internal work.
I added to my code the new clear_to_color16 with mmx support .
The new code work much faster and i think that is stable, but i only tested some hours, and using the allegro examples all
work corretly, I profile the results for DJGPP.
I added to the blitting routines the mmx extension and some 32bit alignments for skip the mask fast. I only rewrite the
16bit version because is my favourite color depth, 16bit is fast than 32bit and 16bit have good pallete of 65535 colors, not
only 256 poor colors.
If the code is added to allegro, i wil be able to work in 8bit version for alignment and mmx, I think that in this depth the
alignment and mmx will work fastest.
I test the new compiled library building all the examples and tests, and all work fine. The test reports great benefits on blits
and masked_blits. The DJGPP version work more fast than MSVC version, (why?), but in the two version the new code is
more fast.
I test the new code with bitmaps for dobble buffer, and only is 1 frame more fast than old version, but if I see the profile
the new code would be able blit 5 frames more by second.
The dobble buffer is for 640x480x16 (= 5Mb frame).
With old code my test reports 100-101 frames/second and with the new code 101-102.
My test is build for MSVC and DJGPP and two reports 100-101 frames-seconds.
* Why DJGPP allegro test is more fast than MSVC allegro test?
* Then why my dobble buffer reports the same values for two versions?
* Why only gain one frame/second?
Please, somebody can answer me this questions?
I only test the new code with DJGPP and MSVC and seems stable. If anybody find some bug plese report me and I repair it.
Thanks.
Allegro 3.9.30 (WIP), djgpp profile results
Graphics driver: VESA 3.0
Description: VESA 3.0 (Universal VESA VBE 6.53 (VxD)), linear
Screen size: 640x480 Virtual screen size: 1024x1024 Color depth: 16 bpp
PC: Intel Pentium 166 MMX , 32MB EDO, S3 VIRGE 325.
Hardware acceleration: <none>
Profile | SCREEN | MEMORY | ||||
Times per second | old | new
no mmx |
new
mmx |
old | new
no mmx |
new
mmx |
clear_to_color() | 8588 | 8588 | 8904 | 10624 | 10624 | 18539 |
vram->vram blit() | 963 | 962 | 1348 | N/A | N/A | N/A |
aligned vram->vram blit() | 993 | 992 | 1376 | N/A | N/A | N/A |
blit() from memory | 6963 | 7001 | 9337 | 13434 | 13398 | 20117 |
aligned blit() from memory | 16696 | 16734 | 17448 | 18770 | 18742 | 26304 |
vram->vram masked_blit() | N/A | N/A | N/A | N/A | N/A | N/A |
masked_blit() from memory | 6363 | 9515 | 9515 | 6384 | 9545 | 9545 |
Clear bitmap is 64x64, this is 8192 bytes
Bitmap for other test is 64x32, this is 4096 bytes
SPEED | SCREEN | MEMORY | ||
in Mbytes/second | old
no mmx |
new
mmx |
old
no mmx |
new
mmx |
clear_to_color() | 70.35 | 72.94 | 87.03 | 151.87 |
vram->vram blit() | 3.94 | 5.52 | - | - |
aligned vram->vram blit() | 4.06 | 5.64 | - | - |
blit() from memory | 26.09 | 38.24 | 55.03 | 82.40 |
aligned blit() from memory | 68.39 | 71.47 | 76.88 | 107.74 |
vram->vram masked_blit() | - | - | - | - |
masked_blit() from memory | 26.06 | 38.97 | 26.15 | 39.10 |
- clear_to_color() is 3.5% faster for vram and 74.5% faster for memory.
- blit() for vram->vram is 40% faster.
- blit() for memory->vram is 46% faster. Aligned 4.5% faster.
- blit() for memory->memory is 50% faster. Aligned 40% faster.
- masked_blit() is 50% faster.
The MSVC version old code reports bad values that DJGPP, for example:
DJGGP vram->vram blit(); old: 6963 new: 9337
MSVC vram->vram blit(); old: 6172 new: 8026
but the for MSVC new code is more fast than old too.
I think that are great results.
What are the changes?
Ok, first linear_blit16:
I added mmx code for fast read/write memory copy.
The mmx version align the width of bitmap for long64 move and at the end move the rest of pixels (0-3). This version works faster when the width of bitmap is long64 aligned (width=number*4). If the width is not long64 align the code have to do the align.
SECRET: Use bitmaps with width long64 aligned. (4,8,12,16,...).
The masked_blit16:
I added 32bit work, the code read 32bit, two pixels, and work with this.
This code only is used if bitmap witdh is long align. (Width even, not odd). (2,4,6,8,10,...).
Why this is more fast 'read 1,work, read 1, work' that 'read2,work,work'?
No is more fast. But the code can skip two pixels at time and this is fast that skip one and skip one. And read 32bit is equal to read 16bit.
Why not a mmx version?
I said -Why no? My mmx version work slowest than my 32bit version.
SECRET: Bitmaps with not pair width are worse.