|Re: [hatari-devel] Re: Suspicious profiler output|
[ Thread Index |
| More lists.tuxfamily.org/hatari-devel Archives
On 09/20/2017 04:55 AM, Miro Kropáček wrote:
I.e. calls to _Atari800_Frame() symbol address are 0.02% of all symbols
called within the profiling run
This number includes all visits to those symbols, not just subroutine
and profiler heuristics were able to track 36% of all the calls
to be originating under _Atari800_Frame().
Inclusive costs are counted only for subroutine calls.
This difference matters (mostly) just for visit counts.
And this is a wrong conclusion. In fact, 99% of all calls should be tracked
Because you've labeled assembly loops, number of aggregated subroutine
calls is much smaller than the number of all the program counter visits
to all the symbols.
* When a a "subroutine call" (JSR/BSR) or "exception" instruction
switches PC to an address listed in symbol table, that is tracked
at run-time as a call in the callgraph
* When "subroutine return" or "exception return" instruction returns
PC to address from which previous subroutine was called from, accrued
run-time costs are transferred to the context of the calling symbol
* only address sites which have a symbol present *when profiling is
active* are tracked like this
This makes sense.
If symbol gets called in any other way, e.g. by jumping, as can
happen with code generated by GCC for static functions, or by stack
manipulation (e.g. EmuTOS does some function calls using RTS!), that
doesn't get recorded in as a call (except EmuTOS thing which I've
special-cased in profiler).
Hmm... so if I have a function "A" which is in assembly and within that I
call function "B" which is in C trough an absolute jump (jmp instruction),
A's costs wont include B's cost and B will be shown as standalone in the
Costs like instruction counts & cycles will be tracked, but depending
on where your symbols are set, you can get confused with visit counts.
And messing with the stack contents to do returns to different place
than where calls happened, like EmuTOS does, would mess up your
callgraph badly. You will see that when the profile ends, typically
by profiler outputting a long list of calls that never returned.
You need to investigate the symbol calls sites in the profiler's
annotated assembly output to find out what's happening...
While _RTI symbol address is visible in the profiler assembly, I didn't
find from the assembly anything that calls that address (although it's
executed). I.e. it's called by some other means, not through something
Hatari tracks as call from the executed assembly instruction.
See above. It's called from another assembly function trough a jump table.
But this still doesn't make sense to me -- if the profiler can't
differentiate it as a separate function, that's fine. But why the total
cost is messed up? If the caller contains any number of jumps, those jumps
should be included in the total cost of the caller then, no?
As I explained, that matters for total visit counts, and aggregated
*subroutine* call counts, not really for other costs.
And as you can see from the assembly, _CPU_rts_handler() address
is *never* executed, only addresses slightly after that, i.e. there
really is nothing that directly calls it.
Well, it's a function pointer. So it's called from a number of places via
It can be correct function, if something had jumped in middle of that
function, or an incorrect function, if you're lacking a symbol for
the correct one.
Note that callgraph shows when function was entered in the middle
of function instead of at the symbol address:
- Symbol name includes the offset
- node has different shape
- line ends are different
-> It's important that your symbol table contains all symbols.
And important to avoid labeling loops. :-)
Only thing suspicious is that by normal statistics I would expect
half of them to be switches *to* interrupt handler, not returns from
one. Based on symbol names, none of them look like interrupt
handlers, and looking at the assembly, your code doesn't have
any RTEs, only TOS code uses those.
Yep, it's 100% user space application.
Fortunately, I didn't get stuck with this problem because I can change
between C and asm backends. And guess what - C backend (doing basically the
same, using same symbol names etc) has 100% correct graph, i.e.
Atari800_Frame() takes really 99% of all CPU cycles, all call counts are
shown correctly etc. So that makes me wonder how a few symbols coming from
an assembly file can mess things up so badly.
C-code rarely has functions which first instruction is loop start,
like in the profile you mailed:
$010d4d9a : movea.l d2,a6 0.01% (22767, 91068, 0, 0)
$010d4d9c : movem.l d0/d2-d7/a2-a3/a5-a6,-(a0) 0.07% (182367, 2188912,
$010d4da0 : movem.l d0/d2-d7/a2-a3/a5-a6,-(a0) 0.07% (182367, 2188860,
$010d4da4 : movem.l d0/d2-d7/a2-a3/a5-a6,-(a0) 0.07% (182367, 2188908,
$010d4da8 : movem.l d0/d2-d7/a2-a3/a5-a6,-(a0) 0.07% (182367, 2188832,
$010d4dac : movem.l d0/d2-d7/a2-a3/a5-a6,-(a0) 0.07% (182367, 2188864,
$010d4db0 : movem.l d0/d2-d7/a2-a3,-(a0) 0.07% (182367, 2189096,
$010d4db4 : subq.l #1,d1 0.07% (182367, 729756,
$010d4db6 : bne.s $10d4d9c 0.07% (182367, 2006388,
$010d4db8 : movem.l (sp)+,d0/d3-d7/a2-a3/a5-a6 0.01% (22767, 364272,
(unless you have e.g. labeled loop as "goto" target.)
I've been wondering whether
vasm couldn't be a culprit here, i.e. producing some a.out symbol names
Anyway -- if you like, I can send you some intermediate object files and
builds to compare the outputs (C vs asm backend),
Unless you think I wasn't able to sufficiently explain what you
saw, I think that's not necessary.
I was able to track down the performance bug, saved me hours of
investigation. Using all those scripts is a bit awkward but the
result is astonishing.
I had lots of free time ~5 years ago when I wrote them, and I was really
missing on Atari all the great tools I've used for profiling Linux SW,
Somebody on Atari-forum commented that it's better than what he's seen
10k $ costing commercial tools for consoles to provide. Doing profiling
in emulator is *much* easier than on real HW though.
On real HW one would need to be concerned about profiling overhead,
memory usage, interrupt disabling etc. And to catch everything like
emulator can, one would need external HW and CPU branch tracing support
(e.g. ARM Embedded Trace Macrocell or Intel Processor Trace).