Re: [hatari-devel] dsp profiler mods

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

On keskiviikko 20 helmikuu 2013, Douglas Little wrote:
> p:099d  200056         (2:0)  and y0,a
>     u:00.66647% c:00.31896% (U:  240240 avC:2.00 avE:0.00 V:0)
> p:099e  5e5cc8         (4:2)  mpy +x0,y1,b a,y:(r4)+
>     u:00.66647% c:00.63792% (U:  240240 avC:4.00 avE:2.00 V:0)
> p:06be  6fef00         (6:2)  move y:(r7+n7),r7
>    u:00.00586% c:00.00842% (U:    2114 avC:6.00 avE:2.00 V:0)
> p:08ae  b34a18         (4:2)  add a,b x0,x:(r2)+n2 b,y:(r6)+
>     u:00.01025% c:00.01052% (U:    3696 avC:4.29 avE:2.29 V:2)
>                         ^ ^
>    ^           ^            ^          ^        ^        ^
>                         a b
>    c           d            e          f        g        h
> 
> 
> a: most recent cycle count for this instruction including any EXT: memory
> penalties
> b: most recent EXT: memory penalty for this instruction

I'll keep in profiler disassembly output only things that are about
the whole profiling run.

With DEBUG defined, it could output suspicous things during profiling,
but EXT accesses are "probably" too normal/frequent to warrant that. :-)


> c: percentage of relative use for whole session (addr.count/all_count)
> d: percentage of relative cost for whole session (addr.cycles/all_cycles)
> e: total use for whole session (addr.count)
> f: average cycle count actually encountered at this address, including
> EXT: memory penalties, for whole session (addr.cycles/count)
> g: average EXT: memory penalty encountered at this address, for whole
> session (addr.extmem_cycles/count)
> h: cycle variance (addr.cycle_diff)

Is the last one max difference like the Hatari code in repository shows,
or do you calculate real statistical variance:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
?

And which one would be more useful?


> The 'a', 'c', 'e', 'h' fields were already present in some form - the
> other fields I have added. The main reason for the changes are:
> 
> 1) easy identification of relative contribution to total time spent (i.e.
> cost of that address, rather than use). This is the primary 'scan by eye'
> metric for optimization.

I'm not going to report two percentage values, but after both CPU & DSP
cycles values have been proven to be fairly correct, I can change percentage
for both to be about cycles, not about instruction counts.

(I don't want to change just DSP, and currently WinUAE CPU core reports too
unstable cycles values for that to be useful as percentage.)


> 2) EXT: penalties for a given address - how often & to what degree. This
> helps guide placement of code, variables, constants, buffers...

CPU doesn't have this kind of information, but I might consider having
it as one of the additional columns after instruction and cycle counts.

However, I'm not going to add/use averages as that will ruin post-
processing.  It would be e.g. sum of all EXT: access costs
(all_cycles - ext_cycles).


(Maybe there could be another, much simpler tool that post-processes
the profile data and replaces the summed values with calculated
averages.)


> 3) unstable penalties - areas which have variable cost depending on
> context. This effect is annoying as cost can balloon unexpectedly when
> inputs change. It's nice to find these bits of code and kill them - or at
> least keep track of them.
> 
> The ':' prefixes are just there to make it easy to import the results
> into Excel/OpenOffice and do some colouring and sorting/graphs etc. -
> something that isn't essential but I do like colour-highlighting the
> costly/flaky areas :-) These prefixes are probably redundant if you're
> processing the results in other ways, except perhaps to help remind
> which each column is for! The rest of the formatting is mainly about
> keeping the columns lined up. And that's really all there is to it.

In the profiler part in disassembly (after "% ("), post-processor expects
to find just comma separate values.

Main reason why I don't use explanations for the values is that the output
is already really wide/verbose, and I expect profiler columns to be fixed
by the time we do next Hatari release, and not change anymore.

(in post-processor handling the prefixes shouldn't be that much extra code)


> It required some changes around the dsp core and profiler to record the
> extmem_cycles info, but it's not very intrusive.

Could you post your patch for that?


> While I'm on the topic, I also wanted to point out a few of other useful
> things I noticed while working with Hatari as a DSP optimization tool...
> 
> The DSP disassembler has been invaluable in helping me find 'accidental
> long immediates' either used as constants or addresses, which occupy 2
> words instead of the intended 1. This isn't really a profiler thing, but
> it does appear in the same profiler output and it's been very useful in
> the optimization process.
> 
> It became quickly obvious that host port polling "jclr
> #0,x:$ffe9,p:$09a4" operations show a high activity level when the DSP
> is outracing the CPU (i.e. a CPU bottleneck), whereas they show little
> or no activity when the DSP is lagging behind the CPU (DSP bottleneck).
> I haven't really spent much time thinking of ways to automate the
> detection of these cases, but it is very useful information to have when
> targeting code to optimize on both sides of the host port. I can
> essentially use the DSP profiling info to find both kinds of bottleneck
> - with some fiddly opcode search patterns - but otherwise pretty
> straightforward. It would be really nice to have some automatic
> detection of these two cases as a profiler duty, even if it can only
> pick up the trivial cases?

Laurent, is there some code which is run on host port access, which
could have a counter for such accesses?

Profiler could then show e.g. host-port-accesses/instructions percentage
when user enters the debugger when DSP profiling is enabled.


> Note: I haven't really looked at callgraph profiler support for the DSP -
> and I expect these changes will likely have broken it completely

Yes, prefixes in counts would break it (percentage isn't parsed, so
there it's fine) and values should be sum, not averages.

One cannot calculate how much functions take if the per-instruction
values are per-instruction averages.


> However I don't expect the DSP callgraph analysis to be as useful as CPU
> callgraph analysis simply because DSP code is rarely very deep -
> complex/unexpected calling relationships don't really exist. The
> per-instruction information is more valuable with loop sizes and
> penalties incurring the bulk of the unknown cost.

Have you looked at the collected function level values at all?

I think they're quite useful in pointing out code addresses that one
should look closer in the profiler dissasembly output (and what functions
one should check in the callgraph, if that's a complex one).

One can miss things when looking just at the disassembly output as there
can be quite a bit of it. :-)


> On the CPU there is
> greater program complexity, a deeper graph and code size / cache
> relationships which make the callgraph view more valuable in some ways
> than the individual instructions. In any case I will likely keep the
> 'patched' Hatari as an optimization tool and use the standard version
> for everything else, and for access to the callgraph stuff if I need it.


	- Eero



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/