Re: [hatari-devel] Hatari profiling question

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Le 10/02/2021 à 20:52, Eero Tamminen a écrit :
Hi,

Profiler just sums the data provided by WinUAE
core, so Toni would need to answer this.

Hi

if you look at the code in newcpu.c, you can see that it's surrounded by

#ifdef WINUAE_FOR_HATARI

so it's code specific to hatari, not toni's code :)


On 10.2.2021 10.52, Christian Zietz wrote:
On 7.2.2021 10.54, Christian Zietz wrote:
That's what I used for my initial analysis in this case, too. Although,
tbh, I'm not fully sure how to interpret the results. I suppose some
instructions are counted with zero cycles because they're fully absorbed by surrounding instructions? But how can an instruction that is executed
1 million times be responsible for 2 million instruction cache misses?

Can you provide example profiler disassembly?

(Maybe on the Hatari mailing list?)

 From my EmuTOS VDI profiling:

Toni, the profiler values in parenthesis are totals of following:
- executed instructions
- used cycles
- i-cache misses
- d-cache hits

(values are for the corresponding memory address,
i.e. data is understandable only as long as code
at the address hasn't changed, which is the case
here as it's in ROM :-))


$00e21794 : adda.w    d2,a0       2.79% (1178064, 2361790, 1419, 0)
$00e21796 : adda.w    d3,a1       2.79% (1178064, 2353471, 38, 0)
$00e21798 : move.l    d1,d0       2.79% (1178064, 7068967, 1178050, 0)
$00e2179a : move.w    (a0),d0     2.79% (1178064, 10689954, 190, 577010)
$00e2179c : swap      d0          2.79% (1178064, 7069362, 1178150, 0)
$00e2179e : move.l    d0,d1       2.79% (1178064, 76, 0, 0)
$00e217a0 : rol.l     d4,d0       2.79% (1178064, 76, 0, 0)
$00e217a2 : jmp       (a2)        2.79% (1178064, 14139013, 2356248, 0)

Like I said, I'm not sure how to interpret the I-cache misses,
particularly in the last line. Is it because it's a JMP and both the
cache miss while fetching the instruction as well as the cache miss
while fetching the jump target count towards the number? Or is it
because the cache misses for the instruction *preceding* the JMP (ROL.L,
shown with 0 I-cache misses) are counted towards the JMP instruction?

Regarding cycles, note that as often reminded, 68020/30 cycles accounting is not accurate at the moment,especially when cache is involved (or MMU (or MMU+cache :) )). So you might get some global %, but on small part of code consisting of just a few instruction, some slight cycles difference in emulation can make big difference with real HW result.

to count data/cache hit/miss, see newcpu.c and search for
CpuInstruction.D_Cache_miss
CpuInstruction.D_Cache_hit

CpuInstruction.I_Cache_miss
CpuInstruction.I_Cache_hit

data cache counter will be updated when accessing the data, and for instruction, the hit/miss will refer to the prefetched word(s) for the next instruction(s).

So, regarding christian's optimisation on a very small blitter loop, I'm afraid the profiler can only give a broad idea, but not sthg 100% precise. When hunting for the last cycles to remove, real HW could give different results from hatari.

Nicolas



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/