[illumos-Developer] webrev: strxfrm improvements

Sun Oct 10 00:25:11 PDT 2010

I've made some improvements to the locale code.  Mostly these centered
around strxfrm, making it generate smaller sort keys (which makes using
the keys go faster.)

If you're able, please review.

The webrev is here, but you might want to read the text below if you
decide to review:

http://mexico.purplecow.org/gdamore/webrev/strxfrm/

The performance benefits of this change are somewhat dependent on usage.

For sorting ordinary ASCII text in UTF-8 locales, generating the
collation key for strxfrm will take slightly longer.  (Not much, maybe 1
or 2 percent.)  However the generated sort keys will be *significantly*
smaller, giving better memory utilization and shorter worst-case strcmp
times. (Typically the strings will be less than half what they were
before.)  Because the leading zeros that were used previously are
generally removed, strcmp will also generally find a difference in the
strings sooner -- usually on the first or second byte compared instead
of the 4th or 5th.  These benefits make me believe that incurring some
extra overhead in strxfrm() is worthwhile.  (Moving cost from an
O(n*log(n)) -- or worse -- component to an O(n) component seems like a
generally worthy change.)

For sorting "interesting" text, including text with ISO 8859 accented
characters, or other non-ASCII symbols, this new code will generally
perform better in strxfrm, *as well as* anything else that collates
these strings.  (strcoll, wcsxfrm, wcscoll).  My guess is that on
"interesting" text you'll see about a 10% performance improvement.
That's about what I've seen so far.  Actual numbers will depend on the
locale you sort in, as well as the content that you are sorting.

What I've done is make localedef *compress* the priorities, and use a
unique number space for each sort level, so that for a given sort
priority, we wind up having smaller priority values.  (Generally, the
primary, or first, sort level has many keys, but subsequent sort levels
have far fewer.  But even those with a large number typically have fewer
than a few thousand.)

Then in libc's strxfrm implementation we are able to look at that, and
only generate enough bytes for that sort level as are required to cover
the maximum possible priority.  This means that usually the the
secondary and following sort levels only require a single byte.

I've also optimized the "catch all" level so that it only considers the
amount of bits needed to uniquely collate characters for a given
encoding.  Unicode, for example, only uses 21 bits out of a wide
character.)  Then we use this to apply the same logic to reduce the
length of the sort key that we did for the other sort levels.

To make all this work right, localedef has to do some extra work,
including an extra pass over the collation data to determine which
priorities are actually used.

The other, and possibly more significant, improvement I have done is to
make the substitution key logic smarter.  Its now a direct index into
the table (so O(1) lookup instead of O(log(n)), and multiple instances
of the same sequence (which are common in some files) are reduced to a
single common entry.  This reduces the size of the collation table data
files significantly... by up to 30% in some extreme cases (Chinese).

As both of these changes result in a change to the LC_COLLATE data file
format, I've changed the "version" string.

	- Garrett