Re: [hatari-devel] How many colors when rendering in Hatari ?

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

On keskiviikko 17 huhtikuu 2013, Nicolas Pomarède wrote:
> On 16/04/2013 21:09, Eero Tamminen wrote:
> > Currently this checking & updates are done in very fine-grained way,
> > long at the time I think.  The proposal was to do this check line
> > at the time, as this can be done much more generic way, just using
> > memcmp().
> 
> I never measured this, but I'm surprised doing a memcmp (so doing 1 read
> of source and 1 read of destination, before a possible write) is faster
> that doing a "brute force" memcpy (1 read of source and 1 write to dest).
>
> Of course the memcmp can avoid to do the ST memory -> SDL conversion
> when nothing change too much on screen.
> 
> At least from my 68000 experience on the ST/Amiga, the direct memcpy
> would be faster.

You're forgetting that there are bitdepth (and planar -> chunky format)
conversions and that graphics pipeline can require multiple copies.


memcmp() is done to the ST screen content lines, i.e. without borders it
would be 32kB x2 reads, with C-library routine that often is hand-written
(MMX, NEON etc) ASM that utilizes cache better than compiled C-code.

However, rest of the display pipeline would, even on 16-bit screen
need 32kB reads & 125kB of writes for the format conversion into
Hatari's SDL surface, and then copying the data to graphics card
memory (125kB reads + writes).

I.e. it's 64kB of (optimized) reads vs. at least 157kB of reads + 250kB
writes, in the case when there would be no screen updates.  With 32bit,
the difference is much larger.  With borders enabled it's even larger.


> But what is the balance point ? If the screen is updated each vbl (case
> of a scrolling in a game or most demos), doing a memcmp is clearly a
> loss of time, doing the copy immediatly would be faster.

If user has borders enabled, but demo (or game) updating screen at every VBL
doesn't write anything to borders and doesn't use spec512 mode, there's
a speed improvement even with 16-bit SDL surface, just for the screen
conversion:
- 416x288@4 = 58.5 kB, at 16-bit, 234 kB
- 320x200@4 = 32 kB, at 16-bit, 128 kB

Converting whole thing:
 = 58.5 reads + 234 writes = 292.5 kB

Whereas checking 416x288 and converting just 320x20 is less:
 = 58.5 * 2 reads + 32 reads + 128 writes = 277 kB


If check would be line based, or demo would have rasters for
normal screen part, but not in top or bottom border:
- 416x200@4 = 40.625 kB, at 16-bit, 162.5 kB, at 32-bit, 325 kB

Checking & converting would not help in 16-bit:
 = 58.5 * 2 reads + 40.625 reads + 162.5 writes = 320.125 kB

But it would still help in (most common) 32-bit case,
as converting whole thing:
 = 58.5 reads + 468 writes = 526.5 kB

Whereas check & conversion would be less:
 = 58.5 * 2 reads + 40.625 reads + 325 writes = 482.625 kB


This doesn't take into account any potential savings later
in the graphics pipeline.  I think many composited desktops
don't do partial upates at all, they've been optimized either
to update whole screen or nothing at all.  However, if they
do support partial updates, savings can be even larger.


> In the game of a gem desktop, it's quite possible resfresh are less
> frequent and not on the whole screen.
>
> So, at what percentage of screen change is it faster to do a direct
> copy/conversion instead of first comparing buffer ?

See above.


> I think this would need to be measured in a reproducable way to see
> what's gain of memcmp in different cases/bpp.

Videl emulation used with Falcon & TT emulation doesn't do any of these
optimizations.  You can easily see even from "top" that it's slower.

To make things comparable to ST emulation, you could use either
EmuTOS or TOS v2, and do:
	--machine falcon --dsp none --cpulevel 0 --cpuclock 8

Then try your favorite demo that is compatible with those
settings.


	- Eero



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/