[illumos-Developer] Request for Advice: Unicode/language expert opinions

Tue May 10 14:32:28 PDT 2011

On Mon, May 9, 2011 at 12:18 AM, Garrett D'Amore <garrett at nexenta.com> wrote:
> So we have a "bug" today where some character mappings within UTF-8
> based locales are not the same across languages.

Yes, I've observed that problem.  Our "tolower" for en_US.UTF-8
has upper/lower mappings only for the Latin alphabet.

> So, we can make it so that iswalpha() reports true for *all*
> alphabetical characters in *all* languages for UTF-8 locales, and we can
> provide towupper and towlower mappings for all characters in Unicode
> that have such a mapping independent of language...
>
> Is there any reason we should not do this?

Yes, when the locale is *.UTF-8, one expects to be able to use
any UTF-8 characters, not just the local subset.

> Note that the Unicode organization does not provide CLDR data this way
> -- they seem to only include the characters that make sense for the
> language represented by a given localedef input file...

Actually, they do provide the full case-folding data here:
  http://unicode.org/Public/UNIDATA/CaseFolding.txt

Simple upper/lower conversions like ctype.h implements
should use only the "common" and "simple" mappings in
that table, as described in the comments at the top.

Note that there's also a general need for Unicode
toupper / tolower functions outside of locale library
support.  That functionality is currently provided by
<sys/u8_textprep.h>, which is compiled into both
libc and the kernel.  We should check whether that
implements what we need and let the ctype support
use that, or otherwise somehow combine them.

> But other OS implementations seem to handle towlower and towupper (and I
> presume the character classification functions as well) universally for
> UTF-8.

Yes, and we should too.

Thanks,
Gordon