Re: [hatari-devel] Linux user-space crashes -> bug in prefetch code when doing bus error for page fault |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/hatari-devel Archives
]
Hi,
On 20.10.2024 23.46, Nicolas Pomarède wrote:
following this issue, I looked at the problem in more detail in latest
june, but I had no time to fix this the proper way, so I'm adding a
temporary fix that should do the job in the meantime (see below)
Thanks, that's great news!
It fixed all BusyBox issues I had documented, and Linux even boots now
2x faster (in emulated time). :-)
=> updated docs accordingly.
Regarding this issue, by adding the following line to m68000.c ;
changed_prefs.cpu_data_cache = false;
we can force data_cache on 68030 to be disabled, even when prefetch mode
or CE mode is enabled.
There's nowadays "--data-cache off" option for that.
That is still needed to get Linux booting with 040 and 060 emulation.
- Eero
In that case the linux kernel is still crashing,
so my conclusion is that the problem is not in the cache but in the
prefetch code.
I added some code to compare prefetched words with "real" content of RAM
and bingo! this showed that each time the linux system gave a core dump,
we have some error messages about prefetch mismatch (see recent commit
64f88f1548 from 2024/10/18 to enable this at compile time)
Looking at the corresponding code in Linux source and using the built
symbols shipped with the kernel posted by Eero, we can see that core
dump is following a bus error and that this bus error is the result of a
page fault when the 68030 MMU "detects" that an address is not present.
In that case a bus error is generated, which calls a specific handler
that sees it's a page fault and try to read the missing data (from disk
or swap space). At the end of this bus error handler the RTE has a
special behaviour to "replay" the instruction that generated the bus
error : now that the page is not faulty anymore, replaying the same
instruction will now work and program will go on.
After talking about this with Toni (WinUAE) in june we came to the
conclusion that it was a bug in the way internal prefetch register are
saved/restored in the case of a bus error.
This is were the emulation has a bug : to replay the instruction it uses
the information stored in a "frame b" stack frame generated by the bus
error. This stack frame contains the 3 words that should restore the
prefetch words, but unfortunately it restores a slightly shifted list of
prefetch words (from PC+2 or PC=4 instead of PC in that case). CPU
emulation will then decode some wrong instructions from these bad
prefetch words, hence the crash.
As the MMU / bus error code is rather complex in that case, I didn't
have time to dive deeply in it since june, so I'm adding a temporary fix.
This will force a reload of the 3 prefetch words from RAM in the case of
a "frame b" stack frame.
With this fix, Linux is now booting correctly, there's no more core dump
(at least not where they were before, I didn't spend hours trying
everything in this linux image :) ) and if one leaves the
WINUAE_FOR_HATARI_DEBUG_PREFETCH_030 #define, this doesn't show any
"printf" error about prefetch mismatch.
For further reference when bus error / stack frame can be improved and
retested in the future, this is the command I run to get the crash :
hatari --machine tt --tos tos306fr.img --dsp off --fpu 68882 --mmu on
-s 14 --ttram 64 --addr24 off -c lilo.cfg --lilo "debug=nfcon
root=/dev/sda ro init=/init" --ide-master bb-rootfs.img
Nicolas