Re: [hatari-devel] DSP performance

2015-07-01 11:37 GMT+02:00 Nicolas Pomarède <npomarede@xxxxxxxxxxxx>:

Le 01/07/2015 11:17, Adam Klobukowski a écrit :

IMO, it can. You know, in advance, how many cycles it will take to
emulate instruction, so you know when it 'happens'. knowing microcode
(or internal CPU layout) would be perfect, but not really necessary, you
just need to 'measure' when instruction tries to access the bus. It
wouldn't be much slower, bacause there would b a lot of empty cycles,
and computation of some instructions would 'smear' on many cycles. This
could be also a good base for otherwise tricky emulation of border
remowal, palette tricks, sync scrolling and so on.

the problem we actually have with falcon in Hatari is that we *don't know in advance* the instructions' cycles (at least not all of them are correct)

Not knowing the perfect cycle count for an instruction is not the same as not knowing it in advance.

If you consider the case of STF + 68000 emulation, where instructions cycles are known, Hatari is capable since many years to run all the tricky shifter effects without having a master clock that runs each component (cpu, dma, fdc, ...) for 2 cycles on every bus slot.

So, the problem is not of choosing if the cpu is the main clock or if we have another clock that masters everything, the problem is to know exactly the number of cycles of each instruction ; knowing the microcode is a plus because you will know exactly when accesses are made and this can solve some tricky timing issues (this is also available in 68000 CE mode at the moment, but not for 68020/30).

The problem can be turned in any way : no correct cycle for each instruction will imply speed difference for cpu and between cpu and dsp. This is not a problem of choosing a reference clock.

In the case of the STF + 68000, since we know the microcode, we could run each instruction by splitting it in 2 cycles, sthg like :

while ( 1 )
{
run_cpu ( 2 );
run_fdc ( 2 );
run_shifter ( 2 );
run_ste_dma_sound ( 2 );
...
add_cycle_master_clock ( 2 );
}

But when you really look at the cases you will have to handle, you see it's inefficient to always split each cpu instructions in 2 cycles.
For example, splitting a DIVU or a MOVEM that take 100 cycles would be a knightmare to handle, because you would need to save the internal context of each instruction to stop it and to restore it 2 cycles later.

Not really. CPU already has a context, and it would just need to remember what instruction is executing, and actually 'execute' it when it finishes. It might be more complex if real bus access times would be emulated.

It's more efficient to run each instruction as a whole, but to update all other component each time the microcode of the instruction do a bus access for example (this is how WinUAE works in CE mode for Amiga : during the emulation of 1 complete instruction, it will update copper, blitter, fdc, ....)

This it not limited to knowing 680x0 exact cycles, but the fact that other chips steal bus accesses too, especially Videl. I've briefly checked Videl code, and I don't see any cycles counted there. Without it, DSP/CPU sync won't ever be good enough, especially for demos that set up weird video modes.

Semper Fidelis

Adam Klobukowski
adamklobukowski@xxxxxxxxx