Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi Thomas, hi Eero,

I've addressed your feedback. Here is the updated patch (this time
based on the current hg version of Hatari):

https://gist.github.com/604f079206d7e5986d26

It can be installed with: patch -p1 <../hatari-hg.gemdos.patch

> Max, I now had a closer look at your patch, and I think it's basically
> a good approach, but there are some things that I'd like to discuss:
>
> 1) I really dislike this part in gemdos.c:
> #ifdef WIN32
>  Str_AtariToWindows(pszFileName, pszFileNameHost, INVALID_CHAR);
> #else
>  Str_AtariToUtf8(pszFileName, pszFileNameHost);
> #endif
> In the end, there is no need to export both functions to other files,
> so I think it would be better to have a "Str_AtariToHost" and a
> "Str_HostToAtari" where the implementation in str.c is taking care of
> the differences instead.

This is now Str_AtariToHost() and Str_HostToAtari().

> 2) The extra step with mapWindowsToUnicode looks cumbersome ... why
> don't you add a proper mapAtariToWindows table directly instead?

The Windows specific mapping tables are no longer needed. I have removed
them. The conversion on the Windows platform now uses the standard C
library functions mbtowc() and wctomb(). Those convert between wide chars
(unicode code points) and the current locale. So the OS does the work.
This works, but it does a little more than you would expect. On Windows
some greek letters are also converted to similar looking latin letters. I
think this does not hurt. It seems to be a "feature" of these functions.

> 3) Str_AtariToUtf8 can create a destination string that is "longer" than
> the source, since UTF8 characters can take multiple bytes, right? There
> seems to be at least one hunk in your patch where you don't take this
> into account so the destination buffer could overflow.

This has been fixed. There is now always a length parameter.

> 4) What if the (Linux) host sytem does not use a UTF-8 locale? I think
> there might still be some people around who use some latin-x locale
> instead.

If you define the macro USE_LOCALE_CHARSET then the conversion
based on the locale is forced to be used (by default it is used only
under Windows). It should work in Linux for e.g. "latin-x" locales if they
are installed. So I haven't used iconv, as mbtowc() and wctomb() already
provide the required functionality.

For Linux with UTF-8 there are two options. It can use the special
Atari<->UTF-8 conversion functions (the default) or the locale based
functions. In general I would prefer using the special UTF-8 functions
as they work without any dependencies on locales. In my view UTF-8
is anyway well suited as internal unicode data format.

I have read that the OSX filesystem returns filenames with composed
characters in a normalized UTF-8 form where the base character and
the diacritical mark are returned as separate characters (decomposed
normal form, NFD) while Linux and most other systems use precomposed
characters (NFC).
Also in the AtariST charset and in latin-1 are many precomposed
characters. Therefore I have added support which converts the
decomposed representation of OSX into its precomposed aquivalent,
but this works only if you do not define the USE_LOCAL_CHARSET.
(Although I haven't tested it yet on OSX).

I haven't yet implemented any configuration. For modern Linux, OSX
and Windows there should be no configuration necessary. Only
if you also want to support older Linux versions which use a locale
different to UTF-8 then this would need a configuration option on
the Hatari command line and/or in the GUI. In that case the
USE_LOCAL_CHARSET preprocessor macro can just be replaced
by a boolean configuration variable. The locale to be used is read
from the LC_ALL environment variable (by a setlocale(LC_ALL, "") call).

Max



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/