[hatari-devel] dsp profiler mods

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


I have made some minor local modifications to the DSP profiler for my own purposes - to help me do some performance diagnostics and guide optimizations.

Note: I am not asking to include such a patch in the main source, but I thought I would explain it anyway in case it's interesting since I am a 'real user' of the profiler. The description/diagram below requires a monospace font to line things up properly ;-)


p:099d  200056         (2:0)  and y0,a                                         u:00.66647% c:00.31896% (U:  240240 avC:2.00 avE:0.00 V:0)
p:099e  5e5cc8         (4:2)  mpy +x0,y1,b a,y:(r4)+                           u:00.66647% c:00.63792% (U:  240240 avC:4.00 avE:2.00 V:0)
p:06be  6fef00         (6:2)  move y:(r7+n7),r7                                u:00.00586% c:00.00842% (U:    2114 avC:6.00 avE:2.00 V:0)
p:08ae  b34a18         (4:2)  add a,b x0,x:(r2)+n2 b,y:(r6)+                   u:00.01025% c:00.01052% (U:    3696 avC:4.29 avE:2.29 V:2)
                        ^ ^                                                    ^           ^            ^          ^        ^        ^
                        a b                                                    c           d            e          f        g        h


a: most recent cycle count for this instruction including any EXT: memory penalties
b: most recent EXT: memory penalty for this instruction

c: percentage of relative use for whole session (addr.count/all_count)
d: percentage of relative cost for whole session (addr.cycles/all_cycles) 
e: total use for whole session (addr.count)
f: average cycle count actually encountered at this address, including EXT: memory penalties, for whole session (addr.cycles/count)
g: average EXT: memory penalty encountered at this address, for whole session (addr.extmem_cycles/count)
h: cycle variance (addr.cycle_diff)



The 'a', 'c', 'e', 'h' fields were already present in some form - the other fields I have added. The main reason for the changes are:

1) easy identification of relative contribution to total time spent (i.e. cost of that address, rather than use). This is the primary 'scan by eye' metric for optimization.
2) EXT: penalties for a given address - how often & to what degree. This helps guide placement of code, variables, constants, buffers...
3) unstable penalties - areas which have variable cost depending on context. This effect is annoying as cost can balloon unexpectedly when inputs change. It's nice to find these bits of code and kill them - or at least keep track of them.

The ':' prefixes are just there to make it easy to import the results into Excel/OpenOffice and do some colouring and sorting/graphs etc. - something that isn't essential but I do like colour-highlighting the costly/flaky areas :-) These prefixes are probably redundant if you're processing the results in other ways, except perhaps to help remind which each column is for! The rest of the formatting is mainly about keeping the columns lined up. And that's really all there is to it.

It required some changes around the dsp core and profiler to record the extmem_cycles info, but it's not very intrusive.


While I'm on the topic, I also wanted to point out a few of other useful things I noticed while working with Hatari as a DSP optimization tool...

The DSP disassembler has been invaluable in helping me find 'accidental long immediates' either used as constants or addresses, which occupy 2 words instead of the intended 1. This isn't really a profiler thing, but it does appear in the same profiler output and it's been very useful in the optimization process.

It became quickly obvious that host port polling "jclr #0,x:$ffe9,p:$09a4" operations show a high activity level when the DSP is outracing the CPU (i.e. a CPU bottleneck), whereas they show little or no activity when the DSP is lagging behind the CPU (DSP bottleneck). I haven't really spent much time thinking of ways to automate the detection of these cases, but it is very useful information to have when targeting code to optimize on both sides of the host port. I can essentially use the DSP profiling info to find both kinds of bottleneck - with some fiddly opcode search patterns - but otherwise pretty straightforward. It would be really nice to have some automatic detection of these two cases as a profiler duty, even if it can only pick up the trivial cases?

So these are the primary bits of information I look at when figuring out what to do with the DSP code from profiling runs, and normally requires little or no further processing/conversion to do so. 



Note: I haven't really looked at callgraph profiler support for the DSP - and I expect these changes will likely have broken it completely. However I don't expect the DSP callgraph analysis to be as useful as CPU callgraph analysis simply because DSP code is rarely very deep - complex/unexpected calling relationships don't really exist. The per-instruction information is more valuable with loop sizes and penalties incurring the bulk of the unknown cost. On the CPU there is greater program complexity, a deeper graph and code size / cache relationships which make the callgraph view more valuable in some ways than the individual instructions. In any case I will likely keep the 'patched' Hatari as an optimization tool and use the standard version for everything else, and for access to the callgraph stuff if I need it.


D.



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/