Cache misses:
28.30% 574172 add_wall_segment (at 0x14b128)
22.99% 466446 build_ssector (at 0x14a6a8)
7.61% 154432 render_wall_1x1 (at 0x14c616)
4.61% 93524 cache_resource (at 0x149c18)
3.46% 70238 flush_visplanes (at 0x14a482)
3.17% 64361 load_real_s_a5_d16_a2 (at 0x14b41a)
2.24% 45445 invisible (at 0x14ac88)
1.94% 39416 nodeincone (at 0x14ad60)
1.89% 38376 get_flat_floor (at 0x14b7de)
1.89% 38347 render_wall (at 0x14c5d4)
1.77% 35818 render_flats_1x1 (at 0x14c128)
1.76% 35712 process_lighting (at 0x14d01a)
1.71% 34652 end_ssector (at 0x14ac9e)
1.49% 30164 get_ssector (at 0x14b704)
1.41% 28546 new_light_level (at 0x14d100)
1.31% 26677 dividing_node (at 0x14ace6)
1.27% 25778 ignore_upper (at 0x14ab3c)
1.20% 24283 finish_tree (at 0x14a5e4)
1.18% 23954 add_lower (at 0x14aa02)
1.10% 22339 add_upper (at 0x14aac0)
'render_wall_1x1' has an intensive inner loop which does not fit in the instruction cache, and I expected it to incur the majority of all cache misses (or at the very least, be very high on the list). Interesting that it only rates 7% from the entire group - but it's not impossible.
'render_flats_1x1' is a similar function which is equally intensive - in fact slightly more intensive, but it does fit in the instruction cache, so misses should be minimized.
This does tie up with the .py analysis, which is good!
However I can't say more than this at the moment - it is difficult to tell if the cache miss information is accurate at a per-instruction level or even for small groups of instructions, or whether there are erratic values scattered everywhere and which just happen to 'even out' in long running tests. It will take more time with the code and tests to figure this out. I will try to do this soon.
Regardless, it's already turning into a powerful optimisation tool which I did not have access to before on the Falcon, and it *will* be of use. :-)