Re: [hatari-devel] Better cycle accurate mode for 68030 / Falcon |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/hatari-devel Archives
]
Hi Nicolas,
Great job, I'll test this as soon as I'm back home.
A few remarks :
In hatari 1.8, I didn't use the 32 bits motorola table, but I recomputed every instruction timing for a 16 bits bus + 4 cycles memory access time.
So, the timings were related to the Falcon bus, not the amiga one ;)
I worked on this with Mikro who exp^lained me how to convert the 32 bits bus to 16 bit bus cycles values.
You can find a doc about this on his site : https://mikro.naprvyraz.sk/ in the "docs" section (68030, ST RAM and things around it).
Another point : Rodolphe did many measurement about the cycles used by the Videl : you can find this on his site :
http://rodolphe.czuba.free.fr/CT2/english/technic.htm
The general idea is that the Videl reads the st memory in longs burst mode,
This can take from 4 up to 32% of the band width of the bus.
Regards
Laurent
----- Mail original -----
De: "Nicolas Pomarède" <npomarede@xxxxxxxxxxxx>
À: hatari-devel@xxxxxxxxxxxxxxxxxxx
Envoyé: Mercredi 8 Février 2017 15:19:27
Objet: [hatari-devel] Better cycle accurate mode for 68030 / Falcon
Hi
I commited the changes I made to better take into account the memory
access times in the Falcon.
This is not a straightforward task, because there's not a lot of
documentation on how the bus is shared between cpu and videl ; we know
it's roughly similar to the ST work (half of the time for cpu and half
of the time for videl), but I never saw any clear documentation from
Atari on this, neither any documentation that describes the more complex
cases when videl use higher depth color modes and can block even further
the cpu from accessing the bus.
We can try to solve this by running benchmarks in different video modes
and see how many % we loose due to video and use this as a first
approximation (but in fact the cpu would be slown down only when pixels
are displayed, not during borders for example).
Anyway, for now the changes I made are assuming a video mode where videl
doesn't take any extra cycles from the cpu, for example 640x200 in 4
colors (ST compatible med res)
I used nembench and nimbench (by Doug) to compare various cases.
nembench does only a few tests, but it can be useful as a 1st approach
to "calibrate" the memory access time. All nembench tests run with
data+instr caches on, so this doesn't cover all the cases.
nimbench by Doug is much more precise (thanks again to Doug for all the
program he wrote for cpu/fpu to improve emulation accuracy !) and cover
more cases with caches on or off.
Here are the test results with latest dev sources :
Nembench :
----------
Integer multiply (16bit) -> 0.640 Mips (~104%)
Integer divide (16bit) -> 0.296 Mips (~81%)
Linear (stalled) integer -> 8.007 Mips (~100%)
Interleaved (piped) integer -> 8.007 Mips (~100%)
16bit read (100% hit) -> 7.902 MByte/sec (~100%)
16bit write (100% hit) -> 8.143 MByte/sec (~135%)
32bit read (100% hit) -> 15.797 MByte/sec (~100%)
32bit write (100% hit) -> 8.210 MByte/sec (~123%)
Linear 32bit read (ST-Ram) -> 5.251 MByte/sec (~98%)
Linear 32bit write (ST-Ram) -> 7.943 MByte/sec (~123%)
Linear 32bit copy (ST-Ram) -> 3.172 MByte/sec (~98%)
Only "slow" 16 bit ram was tested ; results are quite accurate except in
some "write" cases ; for those cases, the problem is not the bus access
time, but the opcode itself which is not accurate yet regarding some
pipeline / parallel processing inside the 68030 (this is more visible in
Nimbench measures below)
Nimbench :
----------
See the attached text file for detailled results per tests, as well as
the pdf where colors were added depending on the accuracy for better
readability..
I compare 4 cases :
- real Falcon : those are the results posted by Doug on his Falcon
- Hatari 1.8 : Laurent added some tables in this version with the
i-cached / not cached cycles values for each opcode. This improved
the results, but it should be noted that the cycles are those from
Motorola doc for a 32 bit bus with no wait state, which is not the
case on the Falcon (16 bit bus, shared with videl). So the results
are sometimes with a big difference compared to real Falcon
- Hatari 2.0 : this used the latest WinUAE core, which gave good instr
and data caches behaviour, but 68030 memory access time were not
changed to reflect a 16 bit bus. The results were worse than Hatari
1.8, some opcodes are sometimes twice faster that what they should
- Hatari dev : it uses a better model for the 16 bit bus shared with
video (not complete yet as higher video modes are not taken into
account)
Overall, Hatari dev has a much better accuracy ; some individual opcodes
are 20% wrong sometimes, but if you look at the colors in the pdf, the
dev version has a lot more green than the previous Hatari versions.
The big differences are due to how the 68030 can sequence memory
accesses, having the possibily to queue an access in parallel of
internal computation ; so a read/write can be delayed internally and add
some extra cycles later (or not, depending how it can overlap with the
next instruction). In some cache cases too, we can also see some
differences (for example NOP).
In the end, the Falcon has 161.32 MIPS and dev version reaches 160.06
MPIS, with an accuracy of 97.83 % (meaning Hatari is globally 2.17%
slower than a real Falcon). I think that's a fairly good score ;-)
Of course, for specialised code using a lot of move.w, results might
vary, as some "move" forms are still 13% off (or even 29% for "move
dn,(an)", which is similar to what nembench showed)
For the moment, I think it's the best we can do. I discussed with Toni
some of those cases were parallel actions are made inside the 68030 but
the model to emulate this is not correct yet. Maybe this can be improved
later.
Nicolas