Re: [hatari-devel] DSP performance

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


I agree, NOP is 4 cycles in non prefetch mode according to the motorola 68030 UM Doc (rev 3).


To compute an instruction cycle , there are 5 parameters to take into account:

- Are we using the instruction cache or not ?
- is there a miss cache or not ?
- The instruction timing given by Motorola for a 32 bit bus, 2 cycles access, aligned instruction
- the .b, .w or .l size of the instruction
- the access mode for this instruction


The motorola documentation explains this well in chapter 11 (MC68030 UM/AD rev 3).
It also gives some examples of cycles analysis with bus access, ... (have a look at figure 11.4 and 11.5)
 

For the falcon values, the bus is a 4 cycles access, but the motorola doc gives all the cycles for a 2 cycles bus access.
The doc also consider a 32 bits bus, but the falcon bus is only 16 bits.
It also considers the instructions to be aligned.


So, we need to compute the internal clock of an instruction and then, recompute the cycles with a 4 cycles bus cycles.

How to do this ?


let's take the nop as example :


NOP 0  0  2(0/0/0)  2(0/1/0)


The 0 0 2(0/0/0)   part is for cache instruction
The 2(0/1/0)       part is for non cache instruction


Let's have a look at the non cache part first 2(0/1/0):

The instruction takes 2 cycles (with a 2 cycles bus access), with 0 extra read access bus, 1 instruction access bus and 0 write access bus (the figures into the parenthesis).

According to the Motorola 68030 UM doc:

(total number of cycles) - (number of bus activity cycles) = (number of internal cycles)
Non-cache case 2(0/1/0):     2 - (0*2 + 1*2 + 0*2) = 0 internal cycles

So, the total number of cycles is 0 (internal cycles) + 0*4 (read access) + 1*4 (instruction access) + 0*4 (write access) = 4 cycles.

The same for all instructions.

For a long instruction, you have to count 2 access to the bus, so mul by 8 instead of 4.
Example : an instruction that would be 6(0,1,1) would take 6 - (0*2 + 1*2 + 1*2) = 2 internal cycles and
  2 + 0*4 + 1*4 + 1*4 = 10 cycles in word access
  2 + 0*8 + 1*8 + 1*8 = 18 cycles in long access
  

OK, so now, the cache part :


Let's have a look at the following instruction : 1  4  10(2,1,1)

The 1 is called the head
The 4 is called the tail

10(2,1,1) is : 10 cycles (for a 2 cycles access bus) composed of 2 read access, 1 instruction access and 1 write access.

So, the instruction internal cycle is 10 - (2*2 + 1*2 + 1*2) = 2 internal cycles

The Falcon .w instruction is = 2 + (2*4 + 1*4 + 1*4) = 18 cycles
The Falcon .l instruction is = 2 + (2*8 + 1*8 + 1*8) = 34 cycles

But you have to remove from this value the min of the head of the instruction and the tail of the previous one.

The current instruction (.w) takes : 18 cycles - min(1,3), so 18-1 = 17 cycles  (1 is the head of this instruction and 3 the tail of the previous one in this example)


Last point : the addressing modes also consume cycles and must be recomputed for Falcon bus.

For example (An)+ takes 0  1  3(1/0/0)  3(1/0/0)

For a .w access , in non cache mode, it takes 3 - 2*1 = 1 internal cycle, so 1 + 1*4 = 5 cycles that must be added to the cycles of the instruction
For a .l access , in non cache mode, it takes 3 - 2*1 = 1 internal cycle, so 1 + 1*8 = 9 cycles that must be added to the cycles of the instruction

Be careful with the addressing modes, there are 5 families with different timings (according to the 68030 UM doc): 

  11.6.1 Fetch Effective Address (fea)
  11.6.2 Fetch Immediate Effective Address (fiea)
  11.6.3 Calculate Effective Address (cea)
  11.6.4 Calculate Immediate Effective Address (ciea)
  11.6.5 Jump Effective Address



That's the whole job I did for all the 1900+ instructions and addressing modes in the static table.
I've converted every addressing mode for all instructions with the Falcon values.

I think the good approach for a generic 68030 emulator should be to keep the value of the inner cycles of each instruction and each addressing mode and then compute the final cycles values accordind to the bus access cycles of the machine.

Ie: for falcon, 4 cycles per bus access, 8 in long access modes, on amiga, 2 cycles per bus access.

Hope this makes sense, else I can explain again.

Regards
Laurent




----- Mail original -----
De: "Nicolas Pomarède" <npomarede@xxxxxxxxxxxx>
À: hatari-devel@xxxxxxxxxxxxxxxxxxx
Envoyé: Mercredi 1 Juillet 2015 10:39:40
Objet: Re: [hatari-devel] DSP performance

Le 01/07/2015 10:34, Miro Kropácek a écrit :
>     I still must be missing something..
>
>     If this is true, then there is no way for NOP to take 4 cycles to
>     execute. Either it takes 2 cycles (cache or next opcode word already
>     in prefetch buffer) or it takes at least 8 cycles.
>
> What's wrong here? It does take two cycles on falcon (if cached) or 4
> (if not, execution time is absorbed in the word prefetch time). To be
> honest, I'm not 100% confident about the latter but it made sense back then.
>

I think it would be surprising that a 68030 at 16 MHz takes 8 cycles for 
a NOP when the old STF's 68000 at 8 MHZ take 4 cycles.
Or it means the memory access time is really the bottleneck (because 8 
cyles @ 16 MHz is the same time as 4 cycles @ 8 MHz)






Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/