Hi,
On 02/02/2018 11:37 PM, Nicolas Pomarède wrote:
Le 02/02/2018 à 21:54, Eero Tamminen a écrit :
Here's example disassembly from EmuTOS 0.9.9.1 on Falcon emu.
Instructions which have either zero instruction cache hits & misses,
or zero data cache hits & misses, are marked with '*':
[...]
As you can see, they're the majority (as indicated by
the profiler cache hit/miss histogram).
If you want more output, I pushed commit that shows the info
after you set "DEBUG" to 1 in profilecpu.c, re-build Hatari,
start Falcon or TT emulation, and enable profiling:
https://hg.tuxfamily.org/mercurialroot/hatari/hatari/rev/822222b90afb
It's common enough that you see it immediately, regardless
of what you run and on what 030 TOS version.
regarding data cache, most instructions in these lines are writing
data, not reading them. So this seems normal that there's no hit/miss
when writing, only when reading.
Ok.
(I'll add a reminder to cache histogram info that data cache
events can happen only for instructions doing data reads.)
As for instructions cache, do you have another example where some
small piece of code would be repeated in a loop but there would be no
hit/miss for instr cache ? Such case would be indeed strange as instr
are likely to go into cache during a small loop.
I changed the cache debugging code to include both hits & misses
for both instruction & data cache in the disassembly.
Attached is profile for beginning of New Beat's Falcon demo
called "Blue". It has several short loops.
The items inside parenthesis are:
- instruction execution count for given address
- cycle count
- i-cache hits
- i-cache misses
- d-cache hits
- d-cache misses
The simplest loop (with code surrounding it) looks like this:
------------------------------------------------------------------
$0001f772: adda.l d2,a2 0.00% (91, 0, 0, 0, 0, 0)
$0001f774: movea.l $21c9a,a3 0.00% (91, 728, 182, 0, 0, 0)
$0001f77a: movea.l (a3),a3 0.00% (91, 728, 0, 0, 0, 0)
$0001f77c: move.w #$1c1f,d5 0.00% (91, 0, 91, 0, 0, 0)
$0001f780: move.l (a3)+,(a2)+ 8.05% (655200, 10483200, 0, 0, 0, 0)
$0001f782: dbra d5,$1f780 8.05% (655200, 0, 1310400, 0, 0, 0)
$0001f786: rts 0.00% (91, 819, 182, 0, 0, 0)
------------------------------------------------------------------
As can be seen from the disassembly stats for the loop,
i-cache data is there only for the branching instruction
(as I deducted from Hatari code).
"dbra" gets 2x i-cache hits for each executed instruction, and
no cycles, whereas the other loop instruction gets all cycles.
Branching at "rts" gets also 2x i-cache hits, and cycles.
Are the hits for instructions leading to the loop, due to
there being prefetch done on them and there naturally being
a hit as there's no diverging code-flow?
Then the other loop with 2+1 instructions:
------------------------------------------------------------------
$1f302 tst.b $21cac 27.93% (2271792, 17040162, 71, 71, 0, 0)
$1f308 beq $1f3e0 27.93% (2271791, 14767904, 4543233, 407, 0, 0)
$1f30c cmpi.w #1,$21c50 0.00% (189, 2268, 0, 189, 0, 0)
...
$1f3da clr.b $21cac 0.00% (189, 756, 0, 0, 0, 0)
$1f3e0 bra $1f302 27.93% (2271790, 18175267, 6815453, 95, 0, 0)
$1f3e4 move.b #1,$21cac 0.00% (189, 2271, 0, 189, 0, 0)
$1f3ec rte 0.00% (189, 5300, 466, 2, 0, 0)
------------------------------------------------------------------
Same thing here, except that the "bra" instruction that's
alone gets actually 3x hits for each executed instruction,
and none of the loop instructions is missing cycles.
(The few i-cache misses are likely due to some interrupt
handler(s) running in the background.)
So, above corresponds somewhat to what I saw in the code,
where the (Hatari specific) CpuInstruction struct gets updated.
How often instruction prefetch is supposed to happen on 030,
when non-branching code is being executed?