[illumos-Developer] Request for Advice: Unicode/language expert opinions

Garrett D'Amore garrett at nexenta.com
Sun May 8 21:18:39 PDT 2011


So we have a "bug" today where some character mappings within UTF-8
based locales are not the same across languages.

For example:

In ru_RU.UTF-8:

д -> Д

(That's the Russian letter "D" in lower and upper case.)

And those are both identified as lower/upper case characters.  However,
in en_US, those characters are not identified as alphabetical characters
at all.

Apparently, there are no conflicting mappings in any of the locales we
have data for... the unique code points in UTF-8 seem to deal with this.

So, we can make it so that iswalpha() reports true for *all*
alphabetical characters in *all* languages for UTF-8 locales, and we can
provide towupper and towlower mappings for all characters in Unicode
that have such a mapping independent of language...

Is there any reason we should not do this?

Note that the Unicode organization does not provide CLDR data this way
-- they seem to only include the characters that make sense for the
language represented by a given localedef input file...

But other OS implementations seem to handle towlower and towupper (and I
presume the character classification functions as well) universally for
UTF-8.

I would really, really like to hear if anyone has a concern here and
believes we should *not* provide the full mappings in all UTF-8 locales.

	- Garrett




More information about the Developer mailing list