Re: [hatari-devel] Re: Profiler

Hi,

This is an interesting conversation so I'll chip in!

Because of memory needed to store all the profiling info
(>100MB for 14MB for ST-RAM), I'm going to store only misses.

This seems reasonable to me - for the *instruction* cache.

(Data goes just into memory sized struct array. It's wasteful,
but fast and with OS overcommit it should work fine as long
as one is not going to try to do profiling with 32-bit Hatari
having lots of TT-RAM configured.)

Also reasonable.

> - If it's a 68030/40/60, you should use I_Cache_miss, I_Cache_hit,
> D_Cache_miss, D_Cache_hit.

Unused values should remain zero, so it shouldn't be a problem to
output both instruction & data cache values, right?

Yes - however I have something to suggest about the data cache.

The data cache actually misses most of the time. It's not like the instruction cache. The more common case is a miss even in well-optimized code (optimized code that isn't specifically aware of a data cache).

However it is extremely interesting to see where the data cache gets successful hits, simply because it is relatively difficult to arrange.

So, if possible, I would suggest that i-cache records misses (since its easy to predict where hits will occur a lot of the time) and d-cache records hits. This is probably the most useful arrangement if trying to limit what gets recorded.

I added data cache miss counter support to Hatari profiler, but
I'm not getting any data cache misses *OR* hits for Falcon emulation
with TOS4. Is TOS4 disabling data cache at boot?

TOS4 should enable it at boot.

On 68030, a value of $0101 in CACR will enable both caches (IIRC - best doublecheck that).

Also, when looking at the code, I see D_Cache variables being
updated only in dcache030 functions, not in dcache040 ones?

040 d-cache is a bit more complicated than 030 since it does not necessarily write-through. It's also a lot bigger. Don't know if that affects the emulation and counting of things.

> Regarding data cache, it's not fully implemented yet for 68040/60, so
> results are not be trusted I guess. But 68030 cache should give correct
> values.

Sounds reasonable - it is a bit more complicated.

I'm not sure that there are many machines around (other than Falcon) which had 68030+dcache, but not local fast RAM at the same time. For the Amiga they typically came on cards with local ram. So there might still be some strange stuff going on in the emulator when measuring d-cache activity. Or maybe its already perfect :)

> One thing to note about 68030 data cache is that if a long word (32
> bits) must be read, it might be stored in 2 cache's entries, depending
> if the address was aligned on 2 or 4 bytes, requiring 2 read in the
> cache.

There is quite complex behaviour around the d-cache.

As mentioned here, a misaligned read (doesn't have to be long - word will do it as well) can fetch two longwords. It just depends on how many 32bit-longwords the misaligned read happens to touch. But the d-cache fetches and allocates aligned longs only..

There are two different modes (normal, and write-allocate) which affect this as well.

Writing will invalidate cache entries which overlap with the written data, but not cache it. If write-allocate mode is enabled, the same is true *unless* it is an aligned longword being written - then it gets cached as well as written.

There is a bit more to it than that, but its fair as a rough description.

>
> So, a read for 32 bits could yield :
> - 1 hit
> - or 1 misses
> - or 1 hit and 1 miss
> - or 2 hits
> - or 2 misses

Yes. You can get complex patterns of hits/misses on 'single' reads depending on size/position/state.

I'm seeing more instruction cache misses per instruction,
upto 6 misses per instruction, just from TOS4 desktop boot,
and going over desktop menus.

Need to be sure that 'misses' here are actually misses in the cache, and not physical words having to be fetched from the bus. There will be 2x as many fetches as misses in general, because of the Falcon's 16bit bus.

But assuming it really is referring to cache misses (longs) then it seems like a lot for one instruction.. 24bytes!

IIRC the CPU won't fetch half of an instruction - it will try to complete the fetch, so it can fetch beyond the immediate longword needed. But 6 seems like a lot to me...

WARNING: 6 CPU instruction cache misses > 5 at 0xe00c9a:
$00e1c3b8 : 4e73 rte

RTE/RTS might be a special case, since its a flow control operation. It may pull in a lot more when it jumps.

WARNING: 6 CPU instruction cache misses > 5 at 0xe03288:
$00e1c236 : 4eb9 00e0 946a jsr $e0946a

Again, flow control instruction returning.

0.30% (294183, 9415132, 1176836, 0)
$00e03288 : 48e7 f0f0 movem.l d0-d3/a0-a3,-(sp)
0.00% (401, 14708, 324, 0)

Interestingly, above happens only without MMU. With MMU, maximum
number of i-cache misses per instruction is 4 for same use-case.

Hmm. The MMU shouldn't affect things. The MMU has its own cache (ATC) but not for instructions - for the MMU tables themselves. It can inhibit caching but should not affect timing unless it has to fetch a table entry, and should not affect hit/miss counts in the CPU caches.

Sounds wrong to me, but someone else might have an opinion here.