Re: [hatari-devel] Very slow emulation when enabling Cycle Exact

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

On 02/13/2018 07:10 PM, Jerome Vernet wrote:
Le 13/02/2018 à 17:58, Nicolas Pomarède a écrit :

It is required for most accurate emulation, especially for some demos
that require perfect sync between cpu and dsp. But at the moment,
cycle exact mode is not perfect either, some instructions don't have
the right timing yet.
But it's still more accurate than when using "prefect mode" for example.


In fact, disabling MMU emulation while Cycle exact (and prefetch mode)
are enabled keep it usable. Just about 90% cpu just in TOS.

On my old (2010) 3 GHz i3, Hatari with cycle exact mode enabled...

TOS v4 idling in desktop takes:
- 80-85% CPU with DSP enabled, regardless of MMU & CPU exact settings
- 70% CPU with DSP emu disabled

This was according to "top", which doesn't take into account
at which frequency the CPU is running at.

DSP can take a *lot* more CPU when it's heavily used.
In TOS desktop it's running just idle loop, that's why
there's only 10-15% difference in CPU usage.


During bootup, CPU utilization was somewhat higher.

I profiled it with Valgrind Callgrind tool.  Attached is
a callgraph of where most of the PC CPU *instructions* are spent
according to it.  Cache prefill emulation seems to cost a lot.

NOTE: While callgrind gives good indication where program might
be spending its time, it's quite inaccurate and maybe more
interesting for finding out how functions get called in given
use-case.  This is because instruction count can differ *a lot*
from what actually takes time (CPU cycles), that doesn't consider
impact of cache & instruction pipelining.

(E.g. disabling tracing which is visible in the callgraph, didn't
reduce Hatari CPU usage reported by "top" noticeably.)


EmuTOS v0.9.9.1 512k version idling in desktop takes:
- 40% CPU regardless of DSP / MMU setting


There are 2 Core used, both about 50%, so things can be improved.

Hatari is single threaded, but SDL audio handling uses an extra thread.

I.e. 2 core usage is probably from your OS ping-ponging the process
between two cores, which can be part of the problem.

In Linux you can bind process to a single core with:
	taskset 0x1 <program>


On MacOS, SDL has been traditionally a performance hog.
While that improved with v2, maybe it's still somewhat an issue.


	- Eero

Attachment: hatari-startup-callgraph.png
Description: PNG image



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/