Re: [hatari-devel] Linux user-space crashes -> bug in prefetch code when doing bus error for page fault

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

On 20.10.2024 23.46, Nicolas Pomarède wrote:
following this issue, I looked at the problem in more detail in latest june, but I had no time to fix this the proper way, so I'm adding a temporary fix that should do the job in the meantime (see below)

Thanks, that's great news!

It fixed all BusyBox issues I had documented, and Linux even boots now 2x faster (in emulated time). :-)

=> updated docs accordingly.


Regarding this issue, by adding the following line to m68000.c ;

  changed_prefs.cpu_data_cache = false;

we can force data_cache on 68030 to be disabled, even when prefetch mode or CE mode is enabled.

There's nowadays "--data-cache off" option for that.

That is still needed to get Linux booting with 040 and 060 emulation.


	- Eero

In that case the linux kernel is still crashing, so my conclusion is that the problem is not in the cache but in the prefetch code.

I added some code to compare prefetched words with "real" content of RAM and bingo! this showed that each time the linux system gave a core dump, we have some error messages about prefetch mismatch (see recent commit 64f88f1548 from 2024/10/18 to enable this at compile time)

Looking at the corresponding code in Linux source and using the built symbols shipped with the kernel posted by Eero, we can see that core dump is following a bus error and that this bus error is the result of a page fault when the 68030 MMU "detects" that an address is not present.

In that case a bus error is generated, which calls a specific handler that sees it's a page fault and try to read the missing data (from disk or swap space). At the end of this bus error handler the RTE has a special behaviour to "replay" the instruction that generated the bus error : now that the page is not faulty anymore, replaying the same instruction will now work and program will go on.

After talking about this with Toni (WinUAE) in june we came to the conclusion that it was a bug in the way internal prefetch register are saved/restored in the case of a bus error.

This is were the emulation has a bug : to replay the instruction it uses the information stored in a "frame b" stack frame generated by the bus error. This stack frame contains the 3 words that should restore the prefetch words, but unfortunately it restores a slightly shifted list of prefetch words (from PC+2 or PC=4 instead of PC in that case). CPU emulation will then decode some wrong instructions from these bad prefetch words, hence the crash.

As the MMU / bus error code is rather complex in that case, I didn't have time to dive deeply in it since june, so I'm adding a temporary fix.

This will force a reload of the 3 prefetch words from RAM in the case of a "frame b" stack frame.

With this fix, Linux is now booting correctly, there's no more core dump (at least not where they were before, I didn't spend hours trying everything in this linux image :) ) and if one leaves the WINUAE_FOR_HATARI_DEBUG_PREFETCH_030 #define, this doesn't show any "printf" error about prefetch mismatch.


For further reference when bus error / stack frame can be improved and retested in the future, this is the command I run to get the crash :

hatari  --machine tt --tos tos306fr.img --dsp off --fpu 68882 --mmu on -s 14 --ttram 64 --addr24 off -c lilo.cfg --lilo "debug=nfcon root=/dev/sda ro init=/init" --ide-master bb-rootfs.img


Nicolas




Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/