[illumos-Developer] webrev: POSIX style localedef & multibyte encoding support

Sun Oct 3 09:04:28 PDT 2010

On 3 Oct 2010, at 16:36, Garrett D'Amore wrote:

> On Sun, 2010-10-03 at 13:33 +0100, Owen Shepherd wrote:
>> On 3 Oct 2010, at 10:08, Garrett D'Amore wrote:
>> 
>>> 
>>> 1) I have added locale data for the English locales that were missing,
>>> and eliminated mklocale and friends.  (I removed US-ASCII without
>>> replacing it.  If you want 7-bit support- use POSIX or C locale.
>>> Otherwise, you really want ISO-8859-1 or -15.) 
>>> 
>> 
>> Is the US-ASCII locale likely to be stored in any data files? Should the system perhaps automatically migrate to the POSIX or C locale if it is requested?
>> 
> 
> Very unlikely.  I don't think Solaris ever had a US-ASCII explicit
> locale, and I don't know of any situations where the locale name would
> be encoded.  Note that pretty much all of the locales in use on POSIX
> systems use ASCII for the low 7 bits.
> 
>> (I admit my POSIX/C locale knowledge is lacking)
> 
> No worries.  POSIX/C is just ASCII.  The only thing is that this locale
> has no specific currency symbols, since it is not tied to any
> nationality.

I was referring to the POSIX/C locale system there. Sorry for the confusion :)

>> 
>>> Once I integrate this change, it will be a fairly trivial matter to add
>>> support for pretty much any locale you like. :-)  Any of about 372 UTF-8
>>> locales are easy.  Any 8859 or KOI8 locale is easy.  Other encodings are
>>> easy *if* I can get a character map for them.  GB18030 is probably the
>>> most painful of those.)
>> 
>> Would the ICU project's character maps work? They have one for GB18030, and its a pretty important locale from a product point of view, since it is required to support it in products sold in China (ICU is released by IBM under a BSD-like license, IIRC)
> 
> Probably.  I just need to get a copy of the character map.  If it is not
> in POSIX form, then a shell script or perl script could probably convert
> it into the proper form.   (Basically, I need a map from GB18030 to
> Unicode.)

Its an XML file. A rather large one... because GB18030 is a mess. A well intentioned mess, but still a mess. It can be found here:
	http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

Its in the standard Unicode character map format

	- Owen