I changed it from min,max to max-min i.e. diff. That way it's
much easier to notice when it happens and post-processor can
handle the differences as "cache misses".
That seems like a good way to do it.
In doomino demo, I got such thing in only one place out
of 1258 instructions:
In well-optimised code it should be rare but it would occur more often in 'support' code which doesn't get the same attention.
....
p:0447 0608a0 (04 cyc) rep #$08 0.38% (960218, 3840872, 0)
p:0448 200032 (02 cyc) asl a 3.04% (7681744, 15363488, 0)
p:0449 0bcc67 (04 cyc) btst #7,a1 0.38% (960218, 3840872, 0)
p:044a 0af0a0 00044f (07 cyc) jcc p:$044f 0.38% (960218, 6721526, 0)
p:044c 45f400 ffff00 (05 cyc) move #$ffff00,x1 0.19% (484216, 2421080, 0)
p:044e 200060 (02 cyc) add x1,a 0.19% (484217, 968439, 3)
p:044f 44ee00 (05 cyc) move x:(r6+n6),x0 0.38% (960219, 4801095, 0)
I think this is suspicious because 'add x1,a' is a trivial instruction which references no memory except it's own instruction fetch. Penalties are not possible on that instruction.
They will only be seen on instructions which have 2 or more memory accesses and where 2 or more of them come from external memory....
p:044f 44ee00 (05 cyc) move x:(r6+n6),x0 0.38% (960219, 4801095, 0)
For example this one might see a penalty sometimes - since the program address is >$100 and the X: address it is fetching from *might* also be >$200, which would mean competition for the external bus inside a single opcode.
(note: internal P memory is half the size of internal X or Y, hence the $100/$200 boundaries mentioned above - IIRC (?) this is because P: addresses are twice as 'wide' - 2 words per address or 48bits... 2 fetches per opcode, which is also probably why no operation takes less than 2 osc cycles)
D.