[hatari-devel] Re: Suspicious profiler output

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


I'm CCing this to hatari-devel as info may be useful also for others
using profiler.

On 09/18/2017 04:54 AM, Miro Kropáček wrote:
An update: I have managed to make the graphs a bit more sensible -- I
didn't realise that -Wl,--tradional-format cuts symbol names after some
length! But strangely, both problems persist.

But I have noticed that in "executed instructions" and "used cycles" graphs
the numbers are correct -- all three functions are marked as 300x executed.
But "visit/calls" have the numbers different (display screen: 600x even
though its successor is 300x and atari800 frame: 600000x even though its
successors are 300x again...)

Maybe I'm just not understanding the notation?

It may be clearer if you add "-t" option to the profile generation
and start from the ASCII output:

  10.60%              183040             set16_2
  10.56%              182362             set256
  10.01%  14.51%      172899    250553   _CPU_GO
   8.75%              151198             copy256
   5.84%              100797             copy256_d
   5.41%  10.81%       93362    186708   _POKEY_Scanline
   5.41%   5.41%       93351     93351   _INPUT_Scanline
   0.02%   0.02%         300       300   _PLATFORM_Keyboard
   0.02%   0.02%         300       300   _rplanes
   0.02%   0.03%         300       600   _PLATFORM_DisplayScree
   0.02%                 300             _null_consol_sound
   0.02%   0.02%         299       299   _GTIA_Frame
   0.02%   0.02%         299       299   _PBI_BB_Frame
   0.02%   0.02%         299       299   _Devices_Frame
   0.02%   0.02%         299       299   _VOTRAXSND_Frame
   0.02%  36.16%         299    624542   _Atari800_Frame

I.e. calls to _Atari800_Frame() symbol address are 0.02% of all symbols
called within the profiling run, and profiler heuristics were able to
track 36% of all the calls to be originating under _Atari800_Frame().

On 17 September 2017 at 22:02, Miro Kropáček <miro.kropacek@xxxxxxxxx>
I'd write this to the Hatari mailing list but maybe it's nothing -- I have
a problem to interpret Hatari's profiler output.

Basically what I did:

    - started the app (atari800.gtp) in Hatari
    - when I was in the screen I was interested in, invoked the debugger
    - a _Atari800_Frame :300
    - profile on
    - ...
    - profile save atari800-profile.txt
    - gst2ascii -l -o ./atari800.gtp > atari800.sym
    - hatari_profile.py -p -g -r atari800.sym atari800-profile.txt
    - dot -Tsvg atari800-profile-0.dot > atari800-profile-0.svg

Now what I can't understand is:

    - if _Atari800_Frame takes 36% of all calls, where's the rest?

While the per-symbol direct call counts are correct, and total 100%,
inclusive costs use heuristics.

That means there are different ways in which the program, compiler
generated code, and the symbols provided by the compiler/user can
mislead Hatari profiler call tracking.

Hatari tracks call hierarchy in following way:

* When a a "subroutine call" (JSR/BSR) or "exception" instruction
  switches PC to an address listed in symbol table, that is tracked
  at run-time as a call in the callgraph
* When "subroutine return" or "exception return" instruction returns
  PC to address from which previous subroutine was called from, accrued
  run-time costs are transferred to the context of the calling symbol
* only address sites which have a symbol present *when profiling is
  active* are tracked like this

Above rules are to avoid tracking e.g. labeled loops as functions
(profiler would quickly run out of call depth when each backwards
jump from loop end to labeled loop start would be a "call").

If symbol gets called in any other way, e.g. by jumping, as can
happen with code generated by GCC for static functions, or by stack
manipulation (e.g. EmuTOS does some function calls using RTS!), that
doesn't get recorded in as a call (except EmuTOS thing which I've
special-cased in profiler).

You need to investigate the symbol calls sites in the profiler's
annotated assembly output to find out what's happening...

Especially when my code is like this:

         for (;;) {
INPUT_key_code = PLATFORM_Keyboard();
if (Atari800_display_screen)

I.e. PLATFORM_Keyboard(), Atari800_Frame() and PLATFORM_DisplayScreen()
should have equal number of calls.

They do have.  Well, Atari800_Frame() has one less (299) than the others
(300), but I think this just due to where you started and stopped
the profile.

    - why are some parts like _RTI or _CPU_rts_handler completely
    isolated? They are surely called from Atari800_Frame() for sure

While _RTI symbol address is visible in the profiler assembly, I didn't
find from the assembly anything that calls that address (although it's
executed).  I.e. it's called by some other means, not through something
Hatari tracks as call from the executed assembly instruction.

And as you can see from the assembly, _CPU_rts_handler() address
is *never* executed, only addresses slightly after that, i.e. there
really is nothing that directly calls it.

However, when the profiler assembly is post-processed, it accumulates
the instruction costs to whatever symbol happens to be closest before
them.  That's why you see _CPU_rts_handler() as node in graph.

It can be correct function, if something had jumped in middle of that
function, or an incorrect function, if you're lacking a symbol for
the correct one.

-> It's important that your symbol table contains all symbols.

My suspect would be the symbols (there's a lot of ignored/skipped warnings
printed) but I have no clue what to do about it?

* "replacing" warnings are harmless, they're normally just libc aliases
  for the same functionality

* "renaming" warnings you could look into.  Why you have so many
  things with duplicate names, are there e.g. duplicate functions
  in multiple objects?

* "ignoring" warnings are for PC address changes that should be
   due to interrupts, not normal subroutine calls.  There aren't
   many of them, so I'm not so worried:
Of all 1714188 switches, ignored 125 for type(s) ['r', 'u', 'x'].

   Only thing suspicious is that by normal statistics I would expect
   half of them to be switches *to* interrupt handler, not returns from
   one.  Based on symbol names, none of them look like interrupt
   handlers, and looking at the assembly, your code doesn't have
   any RTEs, only TOS code uses those.

   -> Using also TOS symbols may help.

If you could take a look and at least point me to the right direction how
to properly profile this binary I'd be grateful. :)

1. Profiling results will be as good or bad as your symbols data.

Make sure to verify from assembly that anything suspicious isn't
lacking symbols (e.g. some MiNTlib builds were lacking symbols
for time functions that took most of the CPU).

2. First look into direct instruction and cycles counts, to
see which functions are bottlenecks and whether they can be

3. Then look into callgraphs to see whether you could reduce
calling of them.

As to  visit/call counts, I would worry only about largest direct
ones, if at all. Inclusive call counts you were looking at are not
really relevant for optimizations.

I would recommend doing profiling with latest EmuTOS 512k Git snapshot
(or release) and giving also EmuTOS symbols for the profiler post-processor. It's nice to see also TOS info in the callgraph.

	- Eero

Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/