[hatari-devel] Re: Opinions on Hatari profiler "user interface"?

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

(Sorry, huge mail, but this is important for anybody interpreting
the profiler post-processor results.)

Next profiler "user interface" issue is showing the caller information
in callgraphs.

Hatari has now code for getting real cumulative costs for functions called
as subroutine calls (with m68k JSR/BSR instructions or their DSP variants),
in addition to the post-processor summing costs listed in the disassembly
and assigning them to a symbol preceeding the instructions having these
costs.


Both methods are needed because they have their own pros & cons:

1. subroutine call/return method and resulting cost information:

  + This information is accurate, *as long as* program doesn't do any
    wierd stack manipulation.  (I've added special handling for EmuTOS
    task switcher which uses *RTS* to call the subroutines)

    + Callgrind format output for Kcachegrind GUI needs accurate data.

  - This information is available in profile *only* if user loaded symbol
    address information to debugger before starting profiling.

  - Subroutine calls are used only for a subset of functions in optimized
    code, even global functions can get called with BRA & JMP, so showing
    data just for subroutine calls could miss *a lot* of the details in
    the code.


2. in-between-symbols method and resulting cost information:

+/- Symbols can be anything; from functions to loop labels.  I.e. single
    function cost could be split to function entry, and different loops  
    it has, with function apparently calling loop label etc.

  + It's nice to have this detailed information on how program is structured
    and where exactly the costs go. (different "call" instruction types are
    indicated with different kind of arrows in callgraphs)

  - With this method one cannot get total costs for higher level symbols
    that call the low level functionality.

  - This information can be easily misleading, not just when there are
    additional symbols (for loop etc labels), but especially when symbols
    are missing.   Then completely unrelated costs will	be assigned to
    a symbol preceeding instructions that have these costs.
	
	NOTE: always give all relevant symbols to the post-processor, regarless
    of whether profile data already contains symbol information.


Current callgraphs produced by the post-processing script by default show
information from latter method, because subroutine information is optional.


If caller information was included in the profile data,
with the post-processor "-p" option:
- "top" lists show subroutine call costs
- callgraph nodes for functions that are called as subroutines
  (i.e. with BSR/JSR), show subroutine information instead
  (-> such nodes have different shape)

See the attached example (Bad Mood CPU side).

The problem with the "-p" option is that:
- different (looking) nodes contain different information
- subroutine costs may not match costs acquired with the in-between-symbols
  method for above mentioned reasons

While having more information is good, it can be confusing to people trying
to interpret the graphs.


When it looks like there are conflicts in this data, it's an indication
that provided symbol data was incomplete, and the issue can be checked
from the profile disassembly information.  Such issues are easiest to
see from "top" lists.

For example profile information of EmuTOS boot (first column is
in-between-symbols cost, second column is total from subroutine calls):
------------
Calls:
 10.80%   10.83%      1800  ___mulsi3
  8.03%    8.09%      1339  _memcpy
  5.20%                866  _int_timerc
  4.61%                768  _call_user_wheel
  3.86%    3.89%       643  _min
  3.76%    3.80%       627  _sti
  3.76%    3.76%       627  _cli
  3.34%    3.35%       556  _bcostat4
  1.88%    0.78%       314  _gsx2
  1.88%   23.13%       314  _screen
  1.88%   25.05%       314  _GSX_ENTRY
  1.88%    1.88%       313  _get_vwk_by_handle
  1.87%                312  _win_sinfo
  1.87%   45.38%       312  _gem
  1.87%                311  _back
  1.87%   37.21%       311  _super
  1.86%    1.86%       310  _rs_obfix
  1.70%    1.69%       283  _gettext
  1.67%   23.56%       278  _gsx_ncode
  1.58%                263  _int_vbl
  1.47%   25.74%       245  _rsrc_obfix
  1.30%    1.30%       216  _kb_timerc_int
  1.30%    1.30%       216  _sndirq

Executed instructions:
 61.63%   61.85%   1897542  _timeout_gpip
  8.60%    9.32%    264918  _draw_rect
  4.96%    4.97%    152579  _run_calibration
  4.06%    4.08%    124944  _blank_out
  3.22%              99134  meminit
  1.86%    1.88%     57254  _abline
  1.43%    1.46%     43936  _text_blt
  1.30%    0.02%     39985  _get_start_addr
  1.05%              32332  _act_bdown

Used cycles:
 61.62%   62.05%  22773364  _timeout_gpip
  5.67%    6.82%   2093624  _draw_rect
  4.29%            1586108  meminit
  4.15%    4.18%   1534476  _run_calibration
  3.04%    3.07%   1121752  _blank_out
  2.98%    3.08%   1100712  _stop_until_interrupt
  1.45%    1.49%    536520  _abline
  1.20%    1.25%    443000  _text_blt
  1.05%    1.06%    389052  ___mulsi3
  1.01%    0.03%    371820  _get_start_addr
------------

....shows total subroutine calls count for "_gsx2" being smaller
than the normal call count:
  1.88%    0.78%       314  _gsx2

This is because most of the calls to it were were jumps/branches
and it itself didn't do any subroutine calls that would be added
to the total.

This can be seen from the profile disassembly output:
------------
_gsx2:
$e0347c : move.l    #$1fe94,$48a8              0.01% (314, 8792, 0)
$e03486 : move.l    #$48a8,d1                  0.01% (314, 3768, 0)
$e0348c : moveq     #$73,d0                    0.01% (314, 1256, 0)
$e0348e : trap      #2                         0.01% (314, 10048, 0)
$e03490 : rts                                  0.01% (314, 5024, 0)
------------
$ grep e0347c etos-boot.txt
$e387d4 : jsr       $e0347c                    0.00% (22, 440, 0)
$e38822 : jsr       $e0347c                    0.00% (4, 80, 0)
$e38f40 : jmp       $e0347c                    0.00% (10, 120, 0)
$e39b0a : jmp       $e0347c                    0.01% (278, 3336, 0)
0xe0347c: 0xe39b0a = 278 b, 0xe387d4 = 22 s 110/2276/32864 22/462/6600, 
0xe38f40 = 10 b, 0xe38822 = 4 s 20/424/6208 4/84/1200, _gsx2
------------


In the case of "_get_start_addr" instruction & cycle count:
  1.30%    0.02%     39985  _get_start_addr

The in-between method accounts unrelated costs to "_get_start_addr":
------------
_get_start_addr:
$e33782 : move.l    d2,-(sp)                   0.00% (24, 288, 0)
$e33784 : move.w    8(sp),d2                   0.00% (24, 288, 0)
$e33788 : andi.w    #$fff0,d2                  0.00% (24, 192, 0)
$e3378c : moveq     #0,d0                      0.00% (24, 96, 0)
$e3378e : move.w    $15fa,d0                   0.00% (24, 384, 0)
$e33794 : lea       $e0aba0,a0                 0.00% (24, 288, 0)
$e3379a : move.b    (a0,d0.l),d0               0.00% (24, 384, 0)
$e3379e : ext.w     d0                         0.00% (24, 96, 0)
$e337a0 : lsr.w     d0,d2                      0.00% (24, 192, 0)
$e337a2 : andi.l    #$ffff,d2                  0.00% (24, 384, 0)
$e337a8 : add.l     $44e,d2                    0.00% (24, 576, 0)
$e337ae : moveq     #0,d0                      0.00% (24, 96, 0)
$e337b0 : move.w    $15fc,d0                   0.00% (24, 384, 0)
$e337b6 : move.l    d0,-(sp)                   0.00% (24, 288, 0)
$e337b8 : movea.w   $e(sp),a0                  0.00% (24, 288, 0)
$e337bc : move.l    a0,-(sp)                   0.00% (24, 288, 0)
$e337be : jsr       $e60790                    0.00% (24, 480, 0)
$e337c4 : addq.l    #8,sp                      0.00% (24, 192, 0)
$e337c6 : add.l     d2,d0                      0.00% (24, 192, 0)
$e337c8 : move.l    (sp)+,d2                   0.00% (24, 288, 0)
$e337ca : rts                                  0.00% (24, 384, 0)
[...]
$e337ce : lea       $ffec(sp),sp               0.00% (11, 88, 0)
....
$e33c60 : move      d0,sr                      0.00% (72, 864, 0)
$e33c62 : rts                                  0.00% (72, 1152, 0)
[...]
__v_show_c:
$e33dd6 : movea.l   $1602,a0                   0.00% (12, 240, 0)
------------
$ grep -A2 _get_start_addr etos512k.sym 
0x00e33782 T _get_start_addr
0x00e33c64 T _v_locator
0x00e33dd6 T __v_show_c
-------------

I.e. the actual "_get_start_addr" function is in $e33782-$e337ca, but
because EmuTOS symbols are missing for code between $e337ce-$e33c62,
(large) costs for that area get assinged for "_get_start_addr"
with the in-between-symbols method.


My question is whether you think there's something I can do to make all
this clearer to the users of the profiler information?


	- Eero

PS. Note that while information needed for in-between-symbols is in
the profile data, post-processor cannot produce graphs that would contain
just subroutine call information.  Adding support for that that would
require changes also to profile data produced by Hatari, and some option
in profiler to select what kind data is collected...

Attachment: bmsym-2-crop.pdf
Description: Adobe PDF document



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/