[illumos-Developer] Request for Advice: Unicode/language expert opinions
Gordon Ross
gordon.w.ross at gmail.com
Tue May 10 14:32:28 PDT 2011
On Mon, May 9, 2011 at 12:18 AM, Garrett D'Amore <garrett at nexenta.com> wrote:
> So we have a "bug" today where some character mappings within UTF-8
> based locales are not the same across languages.
Yes, I've observed that problem. Our "tolower" for en_US.UTF-8
has upper/lower mappings only for the Latin alphabet.
> So, we can make it so that iswalpha() reports true for *all*
> alphabetical characters in *all* languages for UTF-8 locales, and we can
> provide towupper and towlower mappings for all characters in Unicode
> that have such a mapping independent of language...
>
> Is there any reason we should not do this?
Yes, when the locale is *.UTF-8, one expects to be able to use
any UTF-8 characters, not just the local subset.
> Note that the Unicode organization does not provide CLDR data this way
> -- they seem to only include the characters that make sense for the
> language represented by a given localedef input file...
Actually, they do provide the full case-folding data here:
http://unicode.org/Public/UNIDATA/CaseFolding.txt
Simple upper/lower conversions like ctype.h implements
should use only the "common" and "simple" mappings in
that table, as described in the comments at the top.
Note that there's also a general need for Unicode
toupper / tolower functions outside of locale library
support. That functionality is currently provided by
<sys/u8_textprep.h>, which is compiled into both
libc and the kernel. We should check whether that
implements what we need and let the ctype support
use that, or otherwise somehow combine them.
> But other OS implementations seem to handle towlower and towupper (and I
> presume the character classification functions as well) universally for
> UTF-8.
Yes, and we should too.
Thanks,
Gordon
More information about the Developer
mailing list