RE: [AD] Proposal to kill non-UTF-8 support

[ Thread Index | Date Index | More lists.liballeg.org/allegro-developers Archives ]


Title: RE: [AD] Proposal to kill non-UTF-8 support

> But Klingon was rejected recently, although Egyptian
> hieroglyphics are in.

But Tengwar still holds a chance :)

> On a serious note, isn't UTF-8 the most troublesome of all
> because of the
> variable length and potential to be >32 bits?.  I would guess (without
> knowledge) that UCS-4 at least is easy or even necessary as a stage in
> conversions.

The nice thing about UTF-8 is that conversions are not needed
for ASCII strings, which are many. It can be very difficult
to decide whether something should be converted or not. UTF-8
bypasses this problem easily.
Cases like this are: if you get a filename, should it be in
ASCII or not ? When ? When you display it ? When you pass it
to and forth to the OS ? What about code that needs to find
data in it ? This is a real pain to deal with, especially in
plain C, where it is not easy to have a specific class for
i18n strings that you can swap with std::string. Of course,
going this easy way means your code isn't really correct, but
it is more robust.
Also, please note that the "value" (or codepoint) of an unicode
glyph may not be the same as its "string" representation. UTF-8
encoding is a way (in quite many, see unicode.org) to encode
glyphs in the unicode space, UCS-4 or UCS-2LE being two others,
but the actual glyphs represented (and thus their "value") are
the same. Thus, a codepoint might be representable in a 32
bit variable (I believe all unicode codepoints do), but still
take 6 bytes when described as UTF-8.

On a side note, I believe UTF-16 is the same as UCS-2 (roughly,
not taking into account endianness) and UTF-32 is the same as
UCS-4. Can anybody confirm or deny ?

--
Vincent Penquerc'h



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/