Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emula

Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]

To: "hatari-devel@xxxxxxxxxxxxxxxxxxx" <hatari-devel@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation
From: Max Böhm <mboehm3@xxxxxxxxx>
Date: Wed, 23 Jul 2014 23:20:41 +0200
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=GF1ME+8hp56bSCY4ioFLnZXQ2Xrgzmx4KaoYcZ0qCxk=; b=GRUdxs4hW0Yxu3EVK0VbzRH4vWNT/NRc7ZMQBljzEnp/HQ6llzWhAXqsz2YBw/Gj1q zqxbYFHfoRnsFzLnMgeGRP3VjhlYdLrrpOLxKgFVsmWVWTCXTlplOYzUPQaUCbEvDsIs QGm7Iz70zL8u+VaUEpTj+TTnrfV8t/OopwPDDh3FrnSCEIsNMnL8sKFWwvQrU1aL5NT1 8GWN8WEEUYczAw0/Wyw28gr4Avca/ZKUjPGb/MxBLjNDxRGJ9hpOxBP0zH7HFDVUh32m UFBzoUZ6deswJhMOQhmG+SFggUqgkeSiRS5b/5FANOuzF5FB8b5ctsReDCEOLDOhB1ew 4TkA==

Hi Thomas, hi Eero,

I've addressed your feedback. Here is the updated patch (this time
based on the current hg version of Hatari):

https://gist.github.com/604f079206d7e5986d26

It can be installed with: patch -p1 <../hatari-hg.gemdos.patch

> Max, I now had a closer look at your patch, and I think it's basically
> a good approach, but there are some things that I'd like to discuss:
>
> 1) I really dislike this part in gemdos.c:
> #ifdef WIN32
>  Str_AtariToWindows(pszFileName, pszFileNameHost, INVALID_CHAR);
> #else
>  Str_AtariToUtf8(pszFileName, pszFileNameHost);
> #endif
> In the end, there is no need to export both functions to other files,
> so I think it would be better to have a "Str_AtariToHost" and a
> "Str_HostToAtari" where the implementation in str.c is taking care of
> the differences instead.

This is now Str_AtariToHost() and Str_HostToAtari().

> 2) The extra step with mapWindowsToUnicode looks cumbersome ... why
> don't you add a proper mapAtariToWindows table directly instead?

The Windows specific mapping tables are no longer needed. I have removed
them. The conversion on the Windows platform now uses the standard C
library functions mbtowc() and wctomb(). Those convert between wide chars
(unicode code points) and the current locale. So the OS does the work.
This works, but it does a little more than you would expect. On Windows
some greek letters are also converted to similar looking latin letters. I
think this does not hurt. It seems to be a "feature" of these functions.

> 3) Str_AtariToUtf8 can create a destination string that is "longer" than
> the source, since UTF8 characters can take multiple bytes, right? There
> seems to be at least one hunk in your patch where you don't take this
> into account so the destination buffer could overflow.

This has been fixed. There is now always a length parameter.

> 4) What if the (Linux) host sytem does not use a UTF-8 locale? I think
> there might still be some people around who use some latin-x locale
> instead.

If you define the macro USE_LOCALE_CHARSET then the conversion
based on the locale is forced to be used (by default it is used only
under Windows). It should work in Linux for e.g. "latin-x" locales if they
are installed. So I haven't used iconv, as mbtowc() and wctomb() already
provide the required functionality.

For Linux with UTF-8 there are two options. It can use the special
Atari<->UTF-8 conversion functions (the default) or the locale based
functions. In general I would prefer using the special UTF-8 functions
as they work without any dependencies on locales. In my view UTF-8
is anyway well suited as internal unicode data format.

I have read that the OSX filesystem returns filenames with composed
characters in a normalized UTF-8 form where the base character and
the diacritical mark are returned as separate characters (decomposed
normal form, NFD) while Linux and most other systems use precomposed
characters (NFC).
Also in the AtariST charset and in latin-1 are many precomposed
characters. Therefore I have added support which converts the
decomposed representation of OSX into its precomposed aquivalent,
but this works only if you do not define the USE_LOCAL_CHARSET.
(Although I haven't tested it yet on OSX).

I haven't yet implemented any configuration. For modern Linux, OSX
and Windows there should be no configuration necessary. Only
if you also want to support older Linux versions which use a locale
different to UTF-8 then this would need a configuration option on
the Hatari command line and/or in the GUI. In that case the
USE_LOCAL_CHARSET preprocessor macro can just be replaced
by a boolean configuration variable. The locale to be used is read
from the LC_ALL environment variable (by a setlocale(LC_ALL, "") call).

Max

Follow-Ups:
- Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation
  - From: Max Böhm

References:
- [hatari-devel] Character conversion for filenames in GEMDOS HD emulation
  - From: Max Böhm
- Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation
  - From: Eero Tamminen
- Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation
  - From: Max Böhm
- Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation
  - From: Thomas Huth

Messages sorted by: [ date | thread ]
Prev by Date: Re: [hatari-devel] Formatting floppies with TOS in Hatari, should it work?
Next by Date: Re: [hatari-devel] Analyse
Previous by thread: Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation
Next by thread: Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/