Saturday, January 5, 2013

Secrets of Unicode Sorting


Recently a poster in the Apple Support Community pointed out some odd OS X sorting behavior for Latin accented characters, at least when "Standard" sorting is chosen in System Preferences > Language & Text > Language.  Indeed, the results seem counter-intuitive, with characters sorting differently depending on what characters follow them. Here is an example :

Single Character:          A  Á   À   Â   Å    Ä   Ã   Æ
Two Character String:  Ãb Äc Åd Âe Æa Àf Ág Ah

However strange it may appear, this is the correct result of the default Unicode sorting algorithm.  In that system every character is assigned 4 levels of "weights" and a particular formula is used to create sorting "keys" for character strings.  Certain groups of characters are considered essentially the same at the first level, so that the sorting order for a string can be determined by differences at the second level, which is potentially derived from the second character in the string.  To make sorting conform to the expectations of users of particular languages,  additional "tailoring" rules need to be set up to override the results of the Unicode default.

Readers wishing to explore further should see:

No comments: