Re: [hatari-devel] Better cycle accurate mode for 68030 / Falcon

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi Nicolas, 

Great job, I'll test this as soon as I'm back home.


A few remarks :


In hatari 1.8, I didn't use the 32 bits motorola table, but I recomputed every instruction timing for a 16 bits bus + 4 cycles memory access time.
So, the timings were related to the Falcon bus, not the amiga one ;)

I worked on this with Mikro who exp^lained me how to convert the 32 bits bus to 16 bit bus cycles values.
You can find a doc about this on his site : https://mikro.naprvyraz.sk/  in the "docs" section (68030, ST RAM and things around it).



Another point : Rodolphe did many measurement about the cycles used by the Videl : you can find this on his site :

http://rodolphe.czuba.free.fr/CT2/english/technic.htm

The general idea is that the Videl reads the st memory in longs burst mode, 
This can take from 4 up to 32% of the band width of the bus.


Regards

Laurent



----- Mail original -----
De: "Nicolas Pomarède" <npomarede@xxxxxxxxxxxx>
À: hatari-devel@xxxxxxxxxxxxxxxxxxx
Envoyé: Mercredi 8 Février 2017 15:19:27
Objet: [hatari-devel] Better cycle accurate mode for 68030 / Falcon

Hi

I commited the changes I made to better take into account the memory 
access times in the Falcon.

This is not a straightforward task, because there's not a lot of 
documentation on how the bus is shared between cpu and videl ; we know 
it's roughly similar to the ST work (half of the time for cpu and half 
of the time for videl), but I never saw any clear documentation from 
Atari on this, neither any documentation that describes the more complex 
cases when videl use higher depth color modes and can block even further 
the cpu from accessing the bus.

We can try to solve this by running benchmarks in different video modes 
and see how many % we loose due to video and use this as a first 
approximation (but in fact the cpu would be slown down only when pixels 
are displayed, not during borders for example).

Anyway, for now the changes I made are assuming a video mode where videl 
doesn't take any extra cycles from the cpu, for example 640x200 in 4 
colors (ST compatible med res)

I used nembench and nimbench (by Doug) to compare various cases.

nembench does only a few tests, but it can be useful as a 1st approach 
to "calibrate" the memory access time. All nembench tests run with 
data+instr caches on, so this doesn't cover all the cases.

nimbench by Doug is much more precise (thanks again to Doug for all the 
program he wrote for cpu/fpu to improve emulation accuracy !) and cover 
more cases with caches on or off.


Here are the test results with latest dev sources :


Nembench :
----------

Integer multiply (16bit)     -> 0.640 Mips (~104%)
Integer divide (16bit)       -> 0.296 Mips (~81%)
Linear (stalled) integer     -> 8.007 Mips (~100%)
Interleaved (piped) integer  -> 8.007 Mips (~100%)

16bit read (100% hit)        -> 7.902 MByte/sec (~100%)
16bit write (100% hit)       -> 8.143 MByte/sec (~135%)
32bit read (100% hit)        -> 15.797 MByte/sec (~100%)
32bit write (100% hit)       -> 8.210 MByte/sec (~123%)

Linear 32bit read (ST-Ram)   -> 5.251 MByte/sec (~98%)
Linear 32bit write (ST-Ram)  -> 7.943 MByte/sec (~123%)
Linear 32bit copy (ST-Ram)   -> 3.172 MByte/sec (~98%)


Only "slow" 16 bit ram was tested ; results are quite accurate except in 
some "write" cases ; for those cases, the problem is not the bus access 
time, but the opcode itself which is not accurate yet regarding some 
pipeline / parallel processing inside the 68030 (this is more visible in 
Nimbench measures below)



Nimbench :
----------

See the attached text file for detailled results per tests, as well as 
the pdf where colors were added depending on the accuracy for better 
readability..

I compare 4 cases :
  - real Falcon : those are the results posted by Doug on his Falcon

  - Hatari 1.8 : Laurent added some tables in this version with the
    i-cached / not cached cycles values for each opcode. This improved
    the results, but it should be noted that the cycles are those from
    Motorola doc for a 32 bit bus with no wait state, which is not the
    case on the Falcon (16 bit bus, shared with videl). So the results
    are sometimes with a big difference compared to real Falcon

  - Hatari 2.0 : this used the latest WinUAE core, which gave good instr
    and data caches behaviour, but 68030 memory access time were not
    changed to reflect a 16 bit bus. The results were worse than Hatari
    1.8, some opcodes are sometimes twice faster that what they should

  - Hatari dev : it uses a better model for the 16 bit bus shared with
    video (not complete yet as higher video modes are not taken into
    account)

Overall, Hatari dev has a much better accuracy ; some individual opcodes 
are 20% wrong sometimes, but if you look at the colors in the pdf, the 
dev version has a lot more green than the previous Hatari versions.

The big differences are due to how the 68030 can sequence memory 
accesses, having the possibily to queue an access in parallel of 
internal computation ; so a read/write can be delayed internally and add 
some extra cycles later (or not, depending how it can overlap with the 
next instruction). In some cache cases too, we can also see some 
differences (for example NOP).

In the end, the Falcon has 161.32 MIPS and dev version reaches 160.06 
MPIS, with an accuracy of 97.83 % (meaning Hatari is globally 2.17% 
slower than a real Falcon). I think that's a fairly good score ;-)

Of course, for specialised code using a lot of move.w, results might 
vary, as some "move" forms are still 13% off (or even 29% for "move 
dn,(an)", which is similar to what nembench showed)


For the moment, I think it's the best we can do. I discussed with Toni 
some of those cases were parallel actions are made inside the 68030 but 
the model to emulate this is not correct yet. Maybe this can be improved 
later.


Nicolas






Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/