[illumos-Developer] Request for Advice: Unicode/language expert opinions

Tue May 10 14:41:07 PDT 2011

Case folding in Unicode is a bit different... its about creating a case insensitive match, which is not the same as going back and forth between cases.

I think Yuris approach on this is sane.

Gordon Ross <gordon.w.ross at gmail.com> wrote:

>On Mon, May 9, 2011 at 12:18 AM, Garrett D'Amore <garrett at nexenta.com> wrote:
>> So we have a "bug" today where some character mappings within UTF-8
>> based locales are not the same across languages.
>
>Yes, I've observed that problem.  Our "tolower" for en_US.UTF-8
>has upper/lower mappings only for the Latin alphabet.
>
>> So, we can make it so that iswalpha() reports true for *all*
>> alphabetical characters in *all* languages for UTF-8 locales, and we can
>> provide towupper and towlower mappings for all characters in Unicode
>> that have such a mapping independent of language...
>>
>> Is there any reason we should not do this?
>
>Yes, when the locale is *.UTF-8, one expects to be able to use
>any UTF-8 characters, not just the local subset.
>
>> Note that the Unicode organization does not provide CLDR data this way
>> -- they seem to only include the characters that make sense for the
>> language represented by a given localedef input file...
>
>Actually, they do provide the full case-folding data here:
>  http://unicode.org/Public/UNIDATA/CaseFolding.txt
>
>Simple upper/lower conversions like ctype.h implements
>should use only the "common" and "simple" mappings in
>that table, as described in the comments at the top.
>
>Note that there's also a general need for Unicode
>toupper / tolower functions outside of locale library
>support.  That functionality is currently provided by
><sys/u8_textprep.h>, which is compiled into both
>libc and the kernel.  We should check whether that
>implements what we need and let the ctype support
>use that, or otherwise somehow combine them.
>
>> But other OS implementations seem to handle towlower and towupper (and I
>> presume the character classification functions as well) universally for
>> UTF-8.
>
>Yes, and we should too.
>
>Thanks,
>Gordon