[illumos-Developer] webrev: POSIX style localedef & multibyte encoding support
Owen Shepherd
owen.shepherd at e43.eu
Sun Oct 3 09:04:28 PDT 2010
On 3 Oct 2010, at 16:36, Garrett D'Amore wrote:
> On Sun, 2010-10-03 at 13:33 +0100, Owen Shepherd wrote:
>> On 3 Oct 2010, at 10:08, Garrett D'Amore wrote:
>>
>>>
>>> 1) I have added locale data for the English locales that were missing,
>>> and eliminated mklocale and friends. (I removed US-ASCII without
>>> replacing it. If you want 7-bit support- use POSIX or C locale.
>>> Otherwise, you really want ISO-8859-1 or -15.)
>>>
>>
>> Is the US-ASCII locale likely to be stored in any data files? Should the system perhaps automatically migrate to the POSIX or C locale if it is requested?
>>
>
> Very unlikely. I don't think Solaris ever had a US-ASCII explicit
> locale, and I don't know of any situations where the locale name would
> be encoded. Note that pretty much all of the locales in use on POSIX
> systems use ASCII for the low 7 bits.
>
>> (I admit my POSIX/C locale knowledge is lacking)
>
> No worries. POSIX/C is just ASCII. The only thing is that this locale
> has no specific currency symbols, since it is not tied to any
> nationality.
I was referring to the POSIX/C locale system there. Sorry for the confusion :)
>>
>>> Once I integrate this change, it will be a fairly trivial matter to add
>>> support for pretty much any locale you like. :-) Any of about 372 UTF-8
>>> locales are easy. Any 8859 or KOI8 locale is easy. Other encodings are
>>> easy *if* I can get a character map for them. GB18030 is probably the
>>> most painful of those.)
>>
>> Would the ICU project's character maps work? They have one for GB18030, and its a pretty important locale from a product point of view, since it is required to support it in products sold in China (ICU is released by IBM under a BSD-like license, IIRC)
>
> Probably. I just need to get a copy of the character map. If it is not
> in POSIX form, then a shell script or perl script could probably convert
> it into the proper form. (Basically, I need a map from GB18030 to
> Unicode.)
Its an XML file. A rather large one... because GB18030 is a mess. A well intentioned mess, but still a mess. It can be found here:
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Its in the standard Unicode character map format
- Owen
More information about the Developer
mailing list