Re: [hatari-devel] WinUAE CPU core CPU/FPU/DSP performance according to Centurbo benchmark

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Nimbench already shuts off Videl for F30 benchmarking - I'll add some control to shift the video depth up through TC mode if it helps make comparisons.

I recently fixed some stuff for TT and will get that tested first then post a new version to try with Hatari..

D

On 2 January 2015 at 17:05, Laurent Sallafranque <laurent.sallafranque@xxxxxxx> wrote:
I agree with Nicolas , using div for a CPU benchmark gives no information for us.

If someone as a "generic" program that we could use to do some benchmarks, I'm OK to use it and bench my real falcon and hatari to compare and get some values.
But before, we should have a "protocol" to make the test with as few noise as possible (Videl, DMA, ...)

Maybe running the bench in 320x200 4 colors would remove most of the Videl cycles.
It's an idea like that, but if we want to bench the CPU, it should not be "disturbed" by the other components of the Falcon.

Laurent



Le 02/01/2015 15:51, Nicolas Pomarède a écrit :
Le 02/01/2015 15:12, Eero Tamminen a écrit :
Hi,

I was mainly wondering how CPU speed can be
off by >10x, whereas e.g. DSP is only off by <2x...

Numbers were same for 020, these are for 16Mhz 040:
   - CPU 294 Mhz
   - FPU 926 Mhz
   - DSP  32 Mhz

No difference in FPU speed depending on the FPU type,
regardless of Wikipedia stating 040 FPU to be a lot
faster:
http://en.wikipedia.org/wiki/Motorola_68881#Selected_statistics


Attached is 030 Falcon results also from Gembench 4.03.
Integer division seems to be off quite a lot (5x).


Attached is also profile of what the Centurbo benchmark CPU & DSP sides
of the test do, with ROM calls removed (I think they're for GUI updates).

FPU test seems to be just bunch of these:
    fcos.x    fp0

And CPU test bunch of these:
    divs.l    #$a,d7

DSP test seems slightly larger.



Hi,

the problem is that those instructions are mostly those that are not cycle exact at the moment :(

div.l will always return 8 cycles, which is wrong as div/mul will take a different number of cycles depending on their operands' value, and this is not really known for cpu > 68000

same for FPU, cycles are not correct.

So, the problem of this benchmark is that it will mix results from memory copy (move), with arithmetic operations (add/div/...) and FPU.
You could have very good results for move, but if div or FPU get too much differences compared to real HW, then your global benchmark score will be off by a very large factor.

For now, what we need is to have our own *very simple* benchmarks, involving mainly 4-5 instructions at a time if possible :

 - copying memory with .B .W .L variants, with or without cache : this is were we need to update the RAM access time and the fact that they are often rounded to 4 cycles. This would really be the reference test. As long as memory access time are not correct, we won't be good.

 - doing lots of arithmetic operations : apart from div.L/mul.L, I think all timings should be good already

IIRC doug posted some results of a test program he wrote some weeks ago, that would be a good way of comparing emulation and real HW when he has time to work on it.

Some of the results of gembench could be used (ram/rom access, int division, ...) but unless we disassemble it, we don't know what kind of operations are done. It would be better to start with our own/simpler tests to ensure our bases are solid, then move to more complex benchmarks made by others.

In all cases, I think the memory wait cycles that are not yet correct will make most tests fail for the moment, as they will change prefetch time and caching time at the lowest level.

Nicolas













Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/