[illumos-Developer] webrev: POSIX style localedef & multibyte encoding support

Sun Oct 3 02:08:49 PDT 2010

Updated review relative to yesterdays post:
http://mexico.purplecow.org/gdamore/webrev/localedef2/

Review of all the latest changes against illumos-gate:
http://mexico.purplecow.org/gdamore/webrev/localedef-latest/

Detail on the updates:

1) I have added locale data for the English locales that were missing,
and eliminated mklocale and friends.  (I removed US-ASCII without
replacing it.  If you want 7-bit support- use POSIX or C locale.
Otherwise, you really want ISO-8859-1 or -15.) 

2) The data is now packaged.  Yippie.

3) I fixed some complaints with edge cases issues.  CHARMAPs that had
permitted trailing text ("pseudo comments"), are cleaned up.

4) I enabled i18n in localedef itself, and added a _msgs target.

5) The makefile is now committed to hg.  Oops!

6) I've added a new flag, -U, and support for POSIX required -c. -U
helps with these 8859 locales that don't have all the characters -- that
way I can use the much richer UTF-8 locale data, but just strip the data
for character sets that are not recognized.  There was some clever work
needed for that. (Some grammar tweaks.)

Its feeling pretty polished at this point.  The only complaint I know of
is the ... ellipsis handling doesn't quite fully conform.  But we don't
use it anyway, even in the numerous ways that *would* work fine.

Once I integrate this change, it will be a fairly trivial matter to add
support for pretty much any locale you like. :-)  Any of about 372 UTF-8
locales are easy.  Any 8859 or KOI8 locale is easy.  Other encodings are
easy *if* I can get a character map for them.  GB18030 is probably the
most painful of those.)

	- Garrett

PS: I think I could still make some hacks to make collation and hence
sort run faster.  I will probably do that after integration.
(Specifically, I think I can use shorter sort keys by recording the max
priority seen, and separate the priorities for each level.  I can also
get more clever doing strxfrm -- the code does some extra allocations to
convert back and forth with wide character sets so it can use a common
implementation, but I think I could add a native 8-bit implementation of
collate_xfrm.

On Sat, 2010-10-02 at 00:35 -0700, Garrett D'Amore wrote:
> I've made some major changes to the way locale data are generated, and
> we now have a freshly implemented (from scratch by me) localedef
> implementation that is about 95% compliant with POSIX. 
> 
> I need a review:
> 
> http://mexico.purplecow.org/gdamore/webrev/localedef/
> 
> This is not the final version, as there are packaging and locale
> *datafile* changes to be made, but I expect that the program itself will
> be largely unchanged at this point, when I integrate.
> 
> So I'm giving access to the webrev so people can look at it.  Note that
> although the deltas are huge, largely they are because of a few big
> files. That said, localedef itself is about 5000 lines of code,
> including about 650 lines of yacc grammar.  I'd appreciate thoughtful
> review.
> 
> Also, please pay attention to the collation code in libc, as I've made
> a major overhaul of it as well, so it supports multibyte locales as well
> as the full feature set required for POSIX collating rules.
> 
> What's missing from this localedef implementation:
> 
> 	-> support for toggling failure mode on failures
> 	   (warnings are non-fatal, errors are fatal)
> 
> 	-> support for ellipsis without a terminating character symbol
> 	   in the collating order.  (Use UNDEFINED to achieve the same
> 	   effect.)
> 
> 	-> localedef's own messages are ironically not i18n'ified.
> 	   (someone can add settextdomain() and gettext() calls at the
> 	   right point later)
> 
> 	-> user defined "character classes" (for use with wctype()).
> 	   Our C library lacks the support.
> 
> 	-> only certain named multibyte encodings are supported.
> 	   UTF-8 is one of them.  Do any of the others really matter
> 	   anymore?
> 
> 	-> the EUC encodings for CJK need to be revamped slightly in
> 	   libc, as we don't support the variable data required
> 	   anymore under the main "EUC" encoding.   (We'll need to
> 	   add specific C code for each of the main EUC variants, which
> 	   is pretty darn easy.  See wide.c in localedef for sample
> 	   code.)
> 
> 
> That said, this implementation is very nice, I think.  Here are some
> things it has:
> 
> 	-> support for all of the CLDR UTF-8 localedef sources I could
> 	   find (including some really weird ones).
> 
> 	-> full POSIX collating semantics for regular and multibyte
> 	   locales.
> 
> 	-> support for substitutions up to 24 elements long (this was
> 	   required because some Arabian locale had ridiculously long
> 	   substitution lists.  I don't understand how they collate. :-)
> 
> 	-> support for up to 9 collating levels.  (The 10th level
> 	   -- or whatever is the last level -- is used internally for
> 	   the UNDEFINED semantics required by POSIX.)
> 
> 	-> the code is fully CDDL'd and should be quite readable.  If
> 	   you don't like it, blame me.
> 
> 	-> reasonably fast.  Collating UTF-8 is expensive, but in my
> 	   tests, it performed reasonably similar to existing locales
> 	   when sorting moderate data files.
> 
> 
> Anyway, I am hopeful that you will find the code and review useful.  I'd
> like to integrate this code in the next week or two.  The sooner I
> integrate, the sooner you can have support for your favorite
> language. :-)  
> 
> I *would* like to know about anyone who has specific needs for non-UTF-8
> locale support.  The UTF-8 locales come very very easy.  I think most of
> the others will too, but I don't want to waste time doing the work if
> nobody wants them.  (For example, I've been told that KOI8 encodings are
> totally obsolete now.)
> 
> Thanks!
> 
> 	- Garrett
> 
> 
> 
> _______________________________________________
> Developer mailing list
> Developer at lists.illumos.org
> http://lists.illumos.org/m/listinfo/developer