[illumos-Developer] webrev: POSIX style localedef & multibyte encoding support

Sat Oct 2 00:35:33 PDT 2010

I've made some major changes to the way locale data are generated, and
we now have a freshly implemented (from scratch by me) localedef
implementation that is about 95% compliant with POSIX. 

I need a review:

http://mexico.purplecow.org/gdamore/webrev/localedef/

This is not the final version, as there are packaging and locale
*datafile* changes to be made, but I expect that the program itself will
be largely unchanged at this point, when I integrate.

So I'm giving access to the webrev so people can look at it.  Note that
although the deltas are huge, largely they are because of a few big
files. That said, localedef itself is about 5000 lines of code,
including about 650 lines of yacc grammar.  I'd appreciate thoughtful
review.

Also, please pay attention to the collation code in libc, as I've made
a major overhaul of it as well, so it supports multibyte locales as well
as the full feature set required for POSIX collating rules.

What's missing from this localedef implementation:

	-> support for toggling failure mode on failures
	   (warnings are non-fatal, errors are fatal)

	-> support for ellipsis without a terminating character symbol
	   in the collating order.  (Use UNDEFINED to achieve the same
	   effect.)

	-> localedef's own messages are ironically not i18n'ified.
	   (someone can add settextdomain() and gettext() calls at the
	   right point later)

	-> user defined "character classes" (for use with wctype()).
	   Our C library lacks the support.

	-> only certain named multibyte encodings are supported.
	   UTF-8 is one of them.  Do any of the others really matter
	   anymore?

	-> the EUC encodings for CJK need to be revamped slightly in
	   libc, as we don't support the variable data required
	   anymore under the main "EUC" encoding.   (We'll need to
	   add specific C code for each of the main EUC variants, which
	   is pretty darn easy.  See wide.c in localedef for sample
	   code.)

That said, this implementation is very nice, I think.  Here are some
things it has:

	-> support for all of the CLDR UTF-8 localedef sources I could
	   find (including some really weird ones).

	-> full POSIX collating semantics for regular and multibyte
	   locales.

	-> support for substitutions up to 24 elements long (this was
	   required because some Arabian locale had ridiculously long
	   substitution lists.  I don't understand how they collate. :-)

	-> support for up to 9 collating levels.  (The 10th level
	   -- or whatever is the last level -- is used internally for
	   the UNDEFINED semantics required by POSIX.)

	-> the code is fully CDDL'd and should be quite readable.  If
	   you don't like it, blame me.

	-> reasonably fast.  Collating UTF-8 is expensive, but in my
	   tests, it performed reasonably similar to existing locales
	   when sorting moderate data files.

Anyway, I am hopeful that you will find the code and review useful.  I'd
like to integrate this code in the next week or two.  The sooner I
integrate, the sooner you can have support for your favorite
language. :-)  

I *would* like to know about anyone who has specific needs for non-UTF-8
locale support.  The UTF-8 locales come very very easy.  I think most of
the others will too, but I don't want to waste time doing the work if
nobody wants them.  (For example, I've been told that KOI8 encodings are
totally obsolete now.)

Thanks!

	- Garrett