[illumos-Developer] Request for Advice: Unicode/language expert opinions
Garrett D'Amore
garrett at nexenta.com
Tue May 10 14:41:07 PDT 2011
Case folding in Unicode is a bit different... its about creating a case insensitive match, which is not the same as going back and forth between cases.
I think Yuris approach on this is sane.
Gordon Ross <gordon.w.ross at gmail.com> wrote:
>On Mon, May 9, 2011 at 12:18 AM, Garrett D'Amore <garrett at nexenta.com> wrote:
>> So we have a "bug" today where some character mappings within UTF-8
>> based locales are not the same across languages.
>
>Yes, I've observed that problem. Our "tolower" for en_US.UTF-8
>has upper/lower mappings only for the Latin alphabet.
>
>> So, we can make it so that iswalpha() reports true for *all*
>> alphabetical characters in *all* languages for UTF-8 locales, and we can
>> provide towupper and towlower mappings for all characters in Unicode
>> that have such a mapping independent of language...
>>
>> Is there any reason we should not do this?
>
>Yes, when the locale is *.UTF-8, one expects to be able to use
>any UTF-8 characters, not just the local subset.
>
>> Note that the Unicode organization does not provide CLDR data this way
>> -- they seem to only include the characters that make sense for the
>> language represented by a given localedef input file...
>
>Actually, they do provide the full case-folding data here:
> http://unicode.org/Public/UNIDATA/CaseFolding.txt
>
>Simple upper/lower conversions like ctype.h implements
>should use only the "common" and "simple" mappings in
>that table, as described in the comments at the top.
>
>Note that there's also a general need for Unicode
>toupper / tolower functions outside of locale library
>support. That functionality is currently provided by
><sys/u8_textprep.h>, which is compiled into both
>libc and the kernel. We should check whether that
>implements what we need and let the ctype support
>use that, or otherwise somehow combine them.
>
>> But other OS implementations seem to handle towlower and towupper (and I
>> presume the character classification functions as well) universally for
>> UTF-8.
>
>Yes, and we should too.
>
>Thanks,
>Gordon
More information about the Developer
mailing list