Thai-English Bilingual Sorting

ASCII Sort vs. Dictionary Sequence

Even for English, the simple sequential comparison of ANSI C strcmp() based on ASCII code does not follow traditional dictionary sequence, because the punctuation marks and cases are not treated properly. For example,

ASCII Sort Dictionary Sequence
August
Vice versa
Vice-president
august
co-op
container
coop
august
August
container
coop
co-op
Vice-president
Vice versa

ISO/IEC 14651 Sorting Model

In ISO/IEC 14651 International String Ordering, a sorting method for international strings based on ISO/IEC 10646 is provided. The method tries to achieve the dictionary sequence type of sorting, covering multilingual texts.

As an example, for Latin scripts, input string will be decomposed into four levels of comparisons:

Level 1
Input string are rendered case insensitive and diacritical mark insensitive. All special characters are removed.
Ex.
Level 2
Diacritical marks are extracted out of the input string before being compared.
Ex. where a is a symbolic value representing acute accent. Other values are g (grave accent), c (circumflex), t (tilde), u (umlaut) and r (ring). [Note that diacritic properties are encoded in reverse order to follow traditional French dictionaries.]
Level 3
Case information is extracted to be compared.
Ex. where u and l are symbolic values representing upper case and lower case properties, respectively.
Level 4
Special characters are then considered.
Ex. where d is a delimitor. The number preceeding the special character represents the position of the character in the string.

An Extension for Thai Script

Thai string comparisons can be rendered in three levels.

Level 1
Only consonants and vowels are considered. Every leading vowel is swapped with the immediate consequent consonant.
Ex.
Level 2
Tone and diacritic marks are then compared.
Ex. where and are symbolic values representing Mai Ek and Mai Tho, respectively. Other values are (Phinthu), (Yamakkan), (Thantakhat), (Mai Tai Khu), (Mai Tri), and (Mai Chattawa).
Level 3
Punctuation marks are then scored.
Ex. where d is a delimitor. The number preceeding the special character represents the position of the character in the string. [Note: Here we make a distiction from that of ISO/IEC 14651 by counting back from the string tail (in other words, by keeping the complement of the position instead of the position itself), to conform to the sequences in the Random House and Webster Dictionary.]

These levels correspond to level 1, 2 and 4 of ISO/IEC 14651 model, respectively, except that we do not encode Thai tonal marks in reverse order like when encoding Latin diacritical marks. Therefore, we can add Thai script to ISO/IEC 14651 Common Template. And one of us has created an example of such extension.

Thai Sorting Issues

Paiyan Noi and Mai Yamok
Paiyan Noi () is an abbreviation sign, representing omitted text. Its function is like elipsis, although the more correspondent Thai punctuation mark for elipsis is Paiyan Yai (). Therefore, it makes sense to order it after elipsis category in the Common Template, after Devanagari abbreviation sign <U0970>.
Mai Yamok () is a duplication sign, representing a repeat of the preceeding word. There's no appropriate category for it in the Common Templat. So, it deserves a new category within the general punctuation portion.
Nikhahit
According to traditional Thai principle, Nikhahit (-) is a name of a vowel symbol. It, with Pin I (-) compose Sara Ue (-), with Lakkhang () compose Sara Am (-).
However, in Thai document, such function is ineffective, because TIS-620 has defined codes for such combined vowels, and they were input with a single keystroke. Instead, Nikhahit is normally used as Pali-Sanskrit-inherited consonant. For example, following Pali, ط is pronounced "Bhut-Dhang"; and following Sanskrit, is pronounced "Chum-Num". Therefore, Nikhahit arguably deserves to be classified as a consonant, ordered after the last Thai consonant ().
Yamakkan, Pinthu, Thanthakhat, Mai Taikhu and Tone Marks
Yamakkan (-) is an ancient punctuation mark used to mark cluster, such as . Pinthu (-) has two functions, one is the same as Yamakkan, such as a Sanskrit word , the other is to mark final consonant in Pali writing system, such as Ԫڪ.
Thanthakhat (-) is used for killing letter's sounds, such as , in which the consonant is killed (not pronounced), ѡ, in which the three consonants are killed, ү, in which the two letters - are killed.
The three marks can be considered as having less effect in changing words than tonal marks. For example, the meaning of and are not different, while and are. Therefore, they deserved to be ordered before tonal marks.
Pinthu may come between Yamakkan and Thanthakhat, because another function of Thanthakhat is to mark final consonant in ancient writing of Pali words. In summary, the order should be Yamakkan, Pinthu, Thanthakhat, Mai Taikhu, Mai Ek, Mai Tho, Mai Tri, and Mai Chattawa.
Lakkhang Yao and Sara Aa
Lakkhang Yao () is always written after Ru () or Lu (), to produce new symbols Rue () or Lue (), respectively. Traditional Thai principle treats each of these two new symbols as an atomic entity. Therefore, in general, Lakkhang Yao is never written after any letters other than Ru () and Lu (). Meanwhile, the two symbols are never followed by a vowel. Hence, the occurrence of Lakkhang Yao () and Sara Aa ( --a vowel) are mutually exclusive, and assigning order between them seems not necessary. However, due to their similar shapes, they are occasionally confused (in a reasonable way). So, they should be treated as weakly identical, that is, their weights should be different at the last level of comparison.
Fongman, Angkhankhu, and Khomut
Fongman () is used in ancient books as paragraph, sentence, or poem stanza begin marker. Angkhandeaw () is used to end a sentence or a stanza. Angkhankhu () is used to end a chapter or episode. Khomut () ends a story.
These marks can be classified as typographic symbols in the Common Template, where Fongman functions like a bullet. Unfortunately, there is no specific code for Angkhandeaw () and Paiyan Noi is always used instead.

Sorting for TIS-620

Within TIS-620 character set, sorting is less complicated than that on ISO/IEC 10646. We can assign the four levels of weights to each class of character in this manner:

Level Order Blanks Ignored
1 digits (language-insensitive),
English alphabet (case-insensitive),
Thai consonants, Ru (), Lu () (as TIS-620),
Nikkhahit,
Thai vowels (as TIS-620)
- punctuation marks,
Yamakkan,
Pinthu,
Thanthakhat,
Mai Taikhu,
Thai Tone Marks
2 Thai digit script language,
Yamakkan,
Pinthu,
Thanthakhat,
Mai Taikhu,
Mai Ek,
Mai Tho,
Mai Tri,
Mai Chattawa
Arabic digits,
English alphabet,
Thai consonants, Ru (), Lu (),
Nikkhahit,
Thai vowels
punctuation marks
3 space,
non-breaking space,
low line _ ,
hyphen - ,
comma , ,
semicolon ; ,
colon : ,
exclamation ! ,
question ? ,
solidus / ,
full stop . ,
Paiyan Noi ,
Mai Yamok ,
grave ` ,
circumflex ^ ,
tilde ~ ,
apostrophe ' ,
quotation " ,
left parenthesis ( ,
left bracket [ ,
left brace { ,
right brace } ,
right bracket ] ,
right parenthesis ) ,
at @ ,
Baht ,
dollar $ ,
Fongman ,
Angkhankhu ,
Khomut ,
asterisk * ,
back Solidus \ ,
ampersand & ,
number # ,
percent % ,
plus + ,
less Than < ,
equal = ,
greater than > ,
virtical line |
digits,
English alphabet,
Thai consonants, Ru (), Lu (),
Nikkhahit,
Thai vowels,
Yamakkan,
Pinthu,
Thanthakhat,
Mai Taikhu,
Thai Tone Marks
-
4 English upper case,
Thai extra (Lakkhang Yao)
digits,
English lower case letters,
Thai consonants, Ru (), Lu (),
Nikkhahit,
Thai vowels,
Yamakkan,
Pinthu,
Thanthakhat,
Mai Taikhu,
Thai Tone Marks
punctuation marks

After this model, Theppitak Karoonboonyanan has written a C++ library named ThColl, which is downloadable.

References

[1]
ISO/IEC 14651 International string ordering (Draft)
[2]
Trin Tantsetthi, Thai Locale Documentation.
[3]
Theppitak Karoonboonyanan, Thai Sorting Algorithms.
[4]
Pruet Boonma, Thai sorting support for free database server.
[5]
The Royal Institute, The Principle of Punctuation Marks and Other Symbols, The Principle of Spacing, and The Principle of Abbreviation Writing, (5th Publishing), Bangkok, 1990. (in Thai)
[6]
Thai Developer Network Mail Archive.
free html hit counter