[illumos-Advocates] RTI 992 towlower/towupper are broken

Garrett D'Amore garrett at damore.org
Tue May 17 10:33:40 PDT 2011


On May 17, 2011, at 10:07 AM, Gordon Ross <gordon.w.ross at gmail.com> wrote:

> My main concern here is that we don't really know where we stand w.r.t.
> comparison with what other systems do, or standards compliance.
> If "it's a little better" is sufficient for you, then I'll abstain.

A *lot* better is more like it.

> 
> Personally, I'd prefer to see some comparison with, say Apple's
> toupper in a UTF-8 locale.  (All locales are UTF-8 on OSX, right?)
> And a comparison with the full map of UTF-8 upper lower pairs
> (which is a subset of the published case folding data) would be
> interesting.  I suspect we're quite close to having all of them with
> Yuri's proposed changes.  And perhaps a standards reference
> to help us understand what POSIX calls "correct" here.

POSIX actually doesn't say anything about details of character sets besides the "Portable Character Set", which is basically ASCII.  It does define the interfaces that are used to provide data for them though, and the programming interfaces used to access such data.

> 
> It's hard for me to imagine that any of those requests would be
> hard to accomplish.  But if I'm the only one who cares about
> having a complete fix here, then go ahead without me.

You have to choose a vendor, and implement a test against that vendor.  I don't know if Yuri has a Mac; I don't.  I could run a test on Ubuntu for sure, but Linux is hardly the universal standard for compliance. :-)

I'd settle for a promise to run these tests and follow up with bug fix(es) where needed.

> 
> BTW, the localedef standard defines a "copy" operation so that,
> in theory all the *.UTF-8 locales could "copy" the ctype data from
> some other locale, such as en_US.UTF-8.  If we determine that
> these really should be the same for all UTF-8 locales, then that
> would probably be a reasonable way to accomplish it.

Yes, and localedef supports that function.  However, it turns out not to be that beneficial, and the data files - while large - are only intermediates, so the assumption here is that there is little enough reason to do it.  But we *could*, certainly.  (I had already contemplated this approach, actually.)

	- Garrett

> 
> Gordon
> 
> On Tue, May 17, 2011 at 12:45 PM, Garrett D'Amore <garrett at damore.org> wrote:
>> I would prefer to just continue forward with all of your changes.
>> 
>> I believe that there is no key requirement that the u8_* functions match what we have here, and I recognize that the requirements for case folding are different than those for case conversion.  Furthermore, having case conversion functions for character sets for which we have no data, seems wrong.
>> 
>> That said, one possible way to test this is to write a test program which iterates over all of utf8 space and identifies the cases where these functions have a non-identity mapping, and then check with the u8_*.   But I really think Gordon is being unduly cautious.
> 
>> Fundamentally, we need to support the POSIX standards for localedef.  This code does that.  I am very disinterested in trying to special case the character set mappings in order to artifically try to share some code.   Such sharing would break our ability to support correct POSIX localedef, and specifically would not support non-UTF-8 data.
> 
>> Gordon, what do you think, can we just let Yuri move ahead?  Certainly nobody could say his changes do anything except *improve* the current situation.
> 
>>  -- Garrett D'Amore
>> 
>> On May 17, 2011, at 8:59 AM, Yuri Pankov <yuri.pankov at gmail.com> wrote:
>> 
>>> On Mon, May 16, 2011 at 04:09:47PM -0400, Gordon Ross wrote:
>>>> On Wed, May 11, 2011 at 7:34 PM, Yuri Pankov <yuri.pankov at gmail.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> illumos:yuri:~/ws/992-localedef$ hg outgoing -v ssh://anonhg@hg.illumos.org/illumos-gate
>>>>> running ssh anonhg at hg.illumos.org "hg -R illumos-gate serve --stdio"
>>>>> remote: Not trusting file /export/illumos/hgrepos/illumos-gate/.hg/hgrc from untrusted user hg, group hg
>>>>> comparing with ssh://anonhg@hg.illumos.org/illumos-gate
>>>>> searching for changes
>>>>> 
>>>>> changeset:   13369:b913fe55a4c0
>>>>> tag:         tip
>>>>> user:        Yuri Pankov <yuri.pankov at gmail.com>
>>>>> date:        Thu May 12 03:21:34 2011 +0400
>>>>> 
>>>>> description:
>>>>>        992 towlower/towupper are broken
>>>>>        Reviewed by: Garrett D'Amore <garrett at damore.org>
>>>>> 
>>>>> modified:
>>>>>   usr/src/cmd/localedef/Makefile
>>>>>   usr/src/cmd/localedef/ctype.c
>>>>> added:
>>>>>   usr/src/cmd/localedef/data/ctype.sh
>>>>> 
>>>>> remote: Not trusting file /export/illumos/hgrepos/illumos-gate/.hg/hgrc from untrusted user hg, group hg
>>>>> 
>>>>> 
>>>>> Tested by using towlower/towupper functions for latin, cyrillic and
>>>>> greek characters in en_US.UTF-8 and ru_RU.UTF-8 locales - results are
>>>>> the same in both.
>>>> [...]
>>>> 
>>>> Hi Yuri,
>>>> 
>>>> Are you still working on this?
>>>> 
>>>> I'd like to see an answer to the functionality questions about this
>>>> before we integrate.  (How do we know if the fix is complete?)
>>>> 
>>>> I suggested one way you could verify your fix.  I'm sure you could
>>>> find many other ways as well.  Please choose a test method and
>>>> use it to demonstrate that your fix is complete.
>>> 
>>> Ok, let's make this just a fix for __maplower_ext excluding other
>>> changes as I can't comment on the best way to provide common ctype data,
>>> and, more so, on u8_* functions, which seem to be private (as well as
>>> non-standard) to me - I just thought getting ctype data from locales we
>>> actually support seems reasonable, but probably incorrect. I guess we
>>> should continue discussing the best way to do this in the thread Garrett
>>> started.
>>> 
>>> 
>>> Yuri
>>> 
>>> _______________________________________________
>>> Advocates mailing list
>>> Advocates at lists.illumos.org
>>> http://lists.illumos.org/m/listinfo/advocates
>> 



More information about the Advocates mailing list