Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting

[ Thread Index | Date Index | More Archives ]

Le 06/02/2018 à 12:11, Eero Tamminen a écrit :

On 02/02/2018 11:37 PM, Nicolas Pomarède wrote:
Le 02/02/2018 à 21:54, Eero Tamminen a écrit :
Here's example disassembly from EmuTOS on Falcon emu.

Instructions which have either zero instruction cache hits & misses,
or zero data cache hits & misses, are marked with '*':
As you can see, they're the majority (as indicated by
the profiler cache hit/miss histogram).

If you want more output, I pushed commit that shows the info
after you set "DEBUG" to 1 in profilecpu.c, re-build Hatari,
start Falcon or TT emulation, and enable profiling:

It's common enough that you see it immediately, regardless
of what you run and on what 030 TOS version.

regarding data cache, most instructions in these lines are writing data, not reading them. So this seems normal that there's no hit/miss when writing, only when reading.


(I'll add a reminder to cache histogram info that data cache
events can happen only for instructions doing data reads.)

As for instructions cache, do you have another example where some small piece of code would be repeated in a loop but there would be no hit/miss for instr cache ? Such case would be indeed strange as instr are likely to go into cache during a small loop.

I changed the cache debugging code to include both hits & misses
for both instruction & data cache in the disassembly.

Attached is profile for beginning of New Beat's Falcon demo
called "Blue".  It has several short loops.

The items inside parenthesis are:
- instruction execution count for given address
- cycle count
- i-cache hits
- i-cache misses
- d-cache hits
- d-cache misses

The simplest loop (with code surrounding it) looks like this:
$0001f772: adda.l   d2,a2        0.00% (91, 0, 0, 0, 0, 0)
$0001f774: movea.l  $21c9a,a3    0.00% (91, 728, 182, 0, 0, 0)
$0001f77a: movea.l  (a3),a3      0.00% (91, 728, 0, 0, 0, 0)
$0001f77c: move.w   #$1c1f,d5    0.00% (91, 0, 91, 0, 0, 0)

$0001f780: move.l   (a3)+,(a2)+  8.05% (655200, 10483200, 0, 0, 0, 0)
$0001f782: dbra     d5,$1f780    8.05% (655200, 0, 1310400, 0, 0, 0)

$0001f786: rts                   0.00% (91, 819, 182, 0, 0, 0)

As can be seen from the disassembly stats for the loop,
i-cache data is there only for the branching instruction
(as I deducted from Hatari code).

"dbra" gets 2x i-cache hits for each executed instruction, and
no cycles, whereas the other loop instruction gets all cycles.

Branching at "rts" gets also 2x i-cache hits, and cycles.

Are the hits for instructions leading to the loop, due to
there being prefetch done on them and there naturally being
a hit as there's no diverging code-flow?

Then the other loop with 2+1 instructions:
$1f302 tst.b  $21cac     27.93% (2271792, 17040162, 71, 71, 0, 0)
$1f308 beq    $1f3e0     27.93% (2271791, 14767904, 4543233, 407, 0, 0)

$1f30c cmpi.w #1,$21c50   0.00% (189, 2268, 0, 189, 0, 0)
$1f3da clr.b  $21cac      0.00% (189, 756, 0, 0, 0, 0)

$1f3e0 bra    $1f302     27.93% (2271790, 18175267, 6815453, 95, 0, 0)

$1f3e4 move.b #1,$21cac   0.00% (189, 2271, 0, 189, 0, 0)
$1f3ec rte                0.00% (189, 5300, 466, 2, 0, 0)

Same thing here, except that the "bra" instruction that's
alone gets actually 3x hits for each executed instruction,
and none of the loop instructions is missing cycles.

(The few i-cache misses are likely due to some interrupt
handler(s) running in the background.)

So, above corresponds somewhat to what I saw in the code,
where the (Hatari specific) CpuInstruction struct gets updated.

How often instruction prefetch is supposed to happen on 030,
when non-branching code is being executed?


I wanted to add details to my latest mail, but as you guessed it, the differences you see are indeed mostly due to prefetch / pipeline inside the 68020/30.

For the details, see "11.2.2 instruction pipe" in the 68030 user manual doc.

Basically, the cpu has an internal 32 bit reg named "cache holding register" CAHR. This reg is used to fill the internal stages A, B, C and D of the cpu.

One of the difference with the 68000, is that this reg is 32 bits, while on 68000 it's 16 bit. So, on 68000, you have at least a mem access during every instruction to keep this 16 bit prefetch reg filled.

On the 68030, you need to refill when the 2 words of the cache hold reg were pushed to stage A.

So, if we take the example of a flow of instructions where each instruction would be 2 bytes (eg "adda.l d2,a2", "movea.l (a3),a3"), you can see that if the CAHR was filled just before, then you can get 1 word without doing an external mem access, and without even doing an i-cache access.

Imagine a flow of 100 NOP (1 word each), then you will get 1 access to the i-cache every 2 instructions (it could be a hit or a miss). Every other 2 instruction, you get a "free" access to the opcode.

On the contrary, when you have an instruction involving a branch, CAHR must be refilled at the new PC, and you will need to access cache/external mem to do so (so, hit or miss counter will increase)

Note that in the end, it doesn't necessary means that the code will be faster (it depends on the RAM speed), this just explains the flow of memory access. If your RAM is not capable of 32 bit access (so called fast ram), refilling CAHR will take 2 word accesses, instead of 1 long word access.

In the case of i-cache counter in the profiler, maybe you can add a 3rd cases to hit or miss like "prefetch", when hit/miss counter were both 0 for current instruction.


Mail converted by MHonArc 2.6.19+