Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

On torstai 17 heinäkuu 2014, Max Böhm wrote:
> > Max, I now had a closer look at your patch, and I think it's basically
> > a good approach, but there are some things that I'd like to discuss:
> > 
> > 1) I really dislike this part in gemdos.c:
> > #ifdef WIN32
> > Str_AtariToWindows(pszFileName, pszFileNameHost, INVALID_CHAR);
> > #else
> > Str_AtariToUtf8(pszFileName, pszFileNameHost);
> > #endif
> > In the end, there is no need to export both functions to other files,
> > so I think it would be better to have a "Str_AtariToHost" and a
> > "Str_HostToAtari" where the implementation in str.c is taking care of
> > the differences instead.
> > 
> > 2) The extra step with mapWindowsToUnicode looks cumbersome ... why
> > don't you add a proper mapAtariToWindows table directly instead?
> > 
> > 3) Str_AtariToUtf8 can create a destination string that is "longer"
> > than the source, since UTF8 characters can take multiple bytes, right?
> > There seems to be at least one hunk in your patch where you don't take
> > this into account so the destination buffer could overflow.
> > 
> > 4) What if the (Linux) host sytem does not use a UTF-8 locale? I think
> > there might still be some people around who use some latin-x locale
> > instead.

File name encoding doesn't actually come from the system, but from
the file system.  You may mount disks (memory cards, CDs etc) that
have different file name encodings than your system.  Whether file
names show up correctly depends on whether they match your system
locale charmap, or whether you gave correct file name encoding on
the mount options when you do the mounting manually (so that kernel
does the conversion correctly).

Linux distros disk automounts nowadays default to UTF-8, but that
doesn't mean that strings you get are valid UTF-8.

Anything related to locale is a rat's nest.


> Thanks for your comments which make sense to me. I'm currently on travel.
> I'll provide an updated patch when I'm back on Wednesday next week.
> 
> The reason for using unicode tables as as the source and derive the
> Windows <-> Atari mapping from it was, that mapping tables for unicode
> exist for basically all character sets. This makes it easier to add
> support for a new character set. Internally a Windows <-> Atari table is
> created using lazy initialization when the mapping is first used.
> 
> Currently only the utf8 and cp1252 character encodings are implemented.

Here's how locale's character set is detected on Linux:
http://stackoverflow.com/questions/1492918/how-do-you-get-what-kind-of-
encoding-your-system-uses-in-c-c

Something similar would be needed also for Windows.

With which version and locale your Windows your change is tested with?

I kind of doubt modern Windows versions being stuck to fixed size
8-bit (cp1252 or other) encodings for file names.  It might use
UCS-2/UTF-16 or UTF-8...?


> I
> can add support to read a mapping table from a file and/or add tables
> for additional character sets.

Either you need fallback for unrecognized locales, or
you need to get all possibly relevant mappings automatically,
at build- or run-time.

Build-time can use e.g. "recode" utility.  It knows AtariST
charset and about any other charset.  Conversion can be done
like this:
	recode -s ISO-8859-15..AtariST < charmap.txt > iso8859-15-to-atarist.txt

(-s "strict" option removes code points which don't have a mapping.)


At run-time iconv() is nicer alternative, it's already part
of glibc, so it doesn't add library dependency, and API looks
much nicer.  You give from/to encodings to iconv_open() and
the input & output buffers to iconv() calls.

The issue with iconv() is that at least quick listing of
encodings it supports didn't seeme to have an Atari coding,
hopefully I'm wrong.  And I'm not sure what one would use
on Windows.


	- Eero



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/