Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounti

Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]

To: hatari-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
From: Nicolas Pomarède <npomarede@xxxxxxxxxxxx>
Date: Tue, 6 Feb 2018 14:29:59 +0100

Le 06/02/2018 à 12:11, Eero Tamminen a écrit :

Hi,

On 02/02/2018 11:37 PM, Nicolas Pomarède wrote:

Le 02/02/2018 à 21:54, Eero Tamminen a écrit :

Here's example disassembly from EmuTOS 0.9.9.1 on Falcon emu.

Instructions which have either zero instruction cache hits & misses,
or zero data cache hits & misses, are marked with '*':

[...]

As you can see, they're the majority (as indicated by
the profiler cache hit/miss histogram).

If you want more output, I pushed commit that shows the info
after you set "DEBUG" to 1 in profilecpu.c, re-build Hatari,
start Falcon or TT emulation, and enable profiling:
https://hg.tuxfamily.org/mercurialroot/hatari/hatari/rev/822222b90afb

It's common enough that you see it immediately, regardless
of what you run and on what 030 TOS version.

regarding data cache, most instructions in these lines are writingdata, not reading them. So this seems normal that there's no hit/misswhen writing, only when reading.


Ok.

(I'll add a reminder to cache histogram info that data cache
events can happen only for instructions doing data reads.)

As for instructions cache, do you have another example where somesmall piece of code would be repeated in a loop but there would be nohit/miss for instr cache ? Such case would be indeed strange as instrare likely to go into cache during a small loop.


I changed the cache debugging code to include both hits & misses
for both instruction & data cache in the disassembly.

Attached is profile for beginning of New Beat's Falcon demo
called "Blue".  It has several short loops.

The items inside parenthesis are:
- instruction execution count for given address
- cycle count
- i-cache hits
- i-cache misses
- d-cache hits
- d-cache misses

The simplest loop (with code surrounding it) looks like this:
------------------------------------------------------------------
$0001f772: adda.l   d2,a2        0.00% (91, 0, 0, 0, 0, 0)
$0001f774: movea.l  $21c9a,a3    0.00% (91, 728, 182, 0, 0, 0)
$0001f77a: movea.l  (a3),a3      0.00% (91, 728, 0, 0, 0, 0)
$0001f77c: move.w   #$1c1f,d5    0.00% (91, 0, 91, 0, 0, 0)

$0001f780: move.l   (a3)+,(a2)+  8.05% (655200, 10483200, 0, 0, 0, 0)
$0001f782: dbra     d5,$1f780    8.05% (655200, 0, 1310400, 0, 0, 0)

$0001f786: rts                   0.00% (91, 819, 182, 0, 0, 0)
------------------------------------------------------------------

As can be seen from the disassembly stats for the loop,
i-cache data is there only for the branching instruction
(as I deducted from Hatari code).

"dbra" gets 2x i-cache hits for each executed instruction, and
no cycles, whereas the other loop instruction gets all cycles.

Branching at "rts" gets also 2x i-cache hits, and cycles.

Are the hits for instructions leading to the loop, due to
there being prefetch done on them and there naturally being
a hit as there's no diverging code-flow?


Then the other loop with 2+1 instructions:
------------------------------------------------------------------
$1f302 tst.b  $21cac     27.93% (2271792, 17040162, 71, 71, 0, 0)
$1f308 beq    $1f3e0     27.93% (2271791, 14767904, 4543233, 407, 0, 0)

$1f30c cmpi.w #1,$21c50   0.00% (189, 2268, 0, 189, 0, 0)
...
$1f3da clr.b  $21cac      0.00% (189, 756, 0, 0, 0, 0)

$1f3e0 bra    $1f302     27.93% (2271790, 18175267, 6815453, 95, 0, 0)

$1f3e4 move.b #1,$21cac   0.00% (189, 2271, 0, 189, 0, 0)
$1f3ec rte                0.00% (189, 5300, 466, 2, 0, 0)
------------------------------------------------------------------

Same thing here, except that the "bra" instruction that's
alone gets actually 3x hits for each executed instruction,
and none of the loop instructions is missing cycles.

(The few i-cache misses are likely due to some interrupt
handler(s) running in the background.)


So, above corresponds somewhat to what I saw in the code,
where the (Hatari specific) CpuInstruction struct gets updated.

How often instruction prefetch is supposed to happen on 030,
when non-branching code is being executed?

Hi

I wanted to add details to my latest mail, but as you guessed it, thedifferences you see are indeed mostly due to prefetch / pipeline insidethe 68020/30.


For the details, see "11.2.2 instruction pipe" in the 68030 user manual doc.

Basically, the cpu has an internal 32 bit reg named "cache holdingregister" CAHR. This reg is used to fill the internal stages A, B, C andD of the cpu.

One of the difference with the 68000, is that this reg is 32 bits, whileon 68000 it's 16 bit.So, on 68000, you have at least a mem access during every instruction tokeep this 16 bit prefetch reg filled.

On the 68030, you need to refill when the 2 words of the cache hold regwere pushed to stage A.

So, if we take the example of a flow of instructions where eachinstruction would be 2 bytes (eg "adda.l d2,a2", "movea.l (a3),a3"), youcan see that if the CAHR was filled just before, then you can get 1 wordwithout doing an external mem access, and without even doing an i-cacheaccess.

Imagine a flow of 100 NOP (1 word each), then you will get 1 access tothe i-cache every 2 instructions (it could be a hit or a miss). Everyother 2 instruction, you get a "free" access to the opcode.

On the contrary, when you have an instruction involving a branch, CAHRmust be refilled at the new PC, and you will need to accesscache/external mem to do so (so, hit or miss counter will increase)

Note that in the end, it doesn't necessary means that the code will befaster (it depends on the RAM speed), this just explains the flow ofmemory access.If your RAM is not capable of 32 bit access (so called fast ram),refilling CAHR will take 2 word accesses, instead of 1 long word access.

In the case of i-cache counter in the profiler, maybe you can add a 3rdcases to hit or miss like "prefetch", when hit/miss counter were both 0for current instruction.


Nicolas

Follow-Ups:
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Eero Tamminen

References:
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Eero Tamminen
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Nicolas Pomarède
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Eero Tamminen
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Nicolas Pomarède
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Eero Tamminen
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Nicolas Pomarède
- Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
  - From: Eero Tamminen

Messages sorted by: [ date | thread ]
Prev by Date: Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
Next by Date: [hatari-devel] Relase 2.1 this week
Previous by thread: Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting
Next by thread: Re: [hatari-devel] Suspicious instruction & data cache hit/miss accounting

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/