Re: [hatari-devel] Character conversion for filenames in GEMDOS HD emulation

[ Thread Index | Date Index | More lists.tuxfamily.org/hatari-devel Archives ]


Hi,

What if there were two modes for filnames?

1) The way it is done currently.
and
2) Let TOS do what it wants (damn the host).

Windows should be ok because TOS is DOS.
Linux Should be ok because it is more permissive than TOS.


----- Eero Tamminen wrote:
> Hi,
> 
> On torstai 17 heinäkuu 2014, Max Böhm wrote:
> > > Max, I now had a closer look at your patch, and I think it's basically
> > > a good approach, but there are some things that I'd like to discuss:
> > > 
> > > 1) I really dislike this part in gemdos.c:
> > > #ifdef WIN32
> > > Str_AtariToWindows(pszFileName, pszFileNameHost, INVALID_CHAR);
> > > #else
> > > Str_AtariToUtf8(pszFileName, pszFileNameHost);
> > > #endif
> > > In the end, there is no need to export both functions to other files,
> > > so I think it would be better to have a "Str_AtariToHost" and a
> > > "Str_HostToAtari" where the implementation in str.c is taking care of
> > > the differences instead.
> > > 
> > > 2) The extra step with mapWindowsToUnicode looks cumbersome ... why
> > > don't you add a proper mapAtariToWindows table directly instead?
> > > 
> > > 3) Str_AtariToUtf8 can create a destination string that is "longer"
> > > than the source, since UTF8 characters can take multiple bytes, right?
> > > There seems to be at least one hunk in your patch where you don't take
> > > this into account so the destination buffer could overflow.
> > > 
> > > 4) What if the (Linux) host sytem does not use a UTF-8 locale? I think
> > > there might still be some people around who use some latin-x locale
> > > instead.
> 
> File name encoding doesn't actually come from the system, but from
> the file system.  You may mount disks (memory cards, CDs etc) that
> have different file name encodings than your system.  Whether file
> names show up correctly depends on whether they match your system
> locale charmap, or whether you gave correct file name encoding on
> the mount options when you do the mounting manually (so that kernel
> does the conversion correctly).
> 
> Linux distros disk automounts nowadays default to UTF-8, but that
> doesn't mean that strings you get are valid UTF-8.
> 
> Anything related to locale is a rat's nest.
> 
> 
> > Thanks for your comments which make sense to me. I'm currently on travel.
> > I'll provide an updated patch when I'm back on Wednesday next week.
> > 
> > The reason for using unicode tables as as the source and derive the
> > Windows <-> Atari mapping from it was, that mapping tables for unicode
> > exist for basically all character sets. This makes it easier to add
> > support for a new character set. Internally a Windows <-> Atari table is
> > created using lazy initialization when the mapping is first used.
> > 
> > Currently only the utf8 and cp1252 character encodings are implemented.
> 
> Here's how locale's character set is detected on Linux:
> http://stackoverflow.com/questions/1492918/how-do-you-get-what-kind-of-
> encoding-your-system-uses-in-c-c
> 
> Something similar would be needed also for Windows.
> 
> With which version and locale your Windows your change is tested with?
> 
> I kind of doubt modern Windows versions being stuck to fixed size
> 8-bit (cp1252 or other) encodings for file names.  It might use
> UCS-2/UTF-16 or UTF-8...?
> 
> 
> > I
> > can add support to read a mapping table from a file and/or add tables
> > for additional character sets.
> 
> Either you need fallback for unrecognized locales, or
> you need to get all possibly relevant mappings automatically,
> at build- or run-time.
> 
> Build-time can use e.g. "recode" utility.  It knows AtariST
> charset and about any other charset.  Conversion can be done
> like this:
> 	recode -s ISO-8859-15..AtariST < charmap.txt > iso8859-15-to-atarist.txt
> 
> (-s "strict" option removes code points which don't have a mapping.)
> 
> 
> At run-time iconv() is nicer alternative, it's already part
> of glibc, so it doesn't add library dependency, and API looks
> much nicer.  You give from/to encodings to iconv_open() and
> the input & output buffers to iconv() calls.
> 
> The issue with iconv() is that at least quick listing of
> encodings it supports didn't seeme to have an Atari coding,
> hopefully I'm wrong.  And I'm not sure what one would use
> on Windows.
> 
> 
> 	- Eero
> 
> 




Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/