Even for English, the simple sequential comparison of ANSI C strcmp() based on ASCII code does not follow traditional dictionary sequence, because the punctuation marks and cases are not treated properly. For example,
ASCII Sort | Dictionary Sequence |
---|---|
August Vice versa Vice-president august co-op container coop |
august August container coop co-op Vice-president Vice versa |
In ISO/IEC 14651 International String Ordering, a sorting method for international strings based on ISO/IEC 10646 is provided. The method tries to achieve the dictionary sequence type of sorting, covering multilingual texts.
As an example, for Latin scripts, input string will be decomposed into four levels of comparisons:
Thai string comparisons can be rendered in three levels.
These levels correspond to level 1, 2 and 4 of ISO/IEC 14651 model, respectively, except that we do not encode Thai tonal marks in reverse order like when encoding Latin diacritical marks. Therefore, we can add Thai script to ISO/IEC 14651 Common Template. And one of us has created an example of such extension.
Within TIS-620 character set, sorting is less complicated than that on ISO/IEC 10646. We can assign the four levels of weights to each class of character in this manner:
Level | Order | Blanks | Ignored |
---|---|---|---|
1 |
digits (language-insensitive), English alphabet (case-insensitive), Thai consonants, Ru (ฤ), Lu (ฦ) (as TIS-620), Nikkhahit, Thai vowels (as TIS-620) |
- |
punctuation marks, Yamakkan, Pinthu, Thanthakhat, Mai Taikhu, Thai Tonal Marks |
2 |
Thai digit script language, Yamakkan, Pinthu, Thanthakhat, Mai Taikhu, Mai Ek, Mai Tho, Mai Tri, Mai Chattawa |
Arabic digits, English alphabet, Thai consonants, Ru (ฤ), Lu (ฦ), Nikkhahit, Thai vowels |
punctuation marks |
3 |
space, non-breaking space, low line _ , hyphen - , comma , , semicolon ; , colon : , exclamation ! , question ? , solidus / , full stop . , Paiyan Noi ฯ , Mai Yamok ๆ , grave ` , circumflex ^ , tilde ~ , apostrophe ' , quotation " , left parenthesis ( , left bracket [ , left brace { , right brace } , right bracket ] , right parenthesis ) , at @ , Baht ฿ , dollar $ , Fongman ๏ , Angkhankhu ๚ , Khomut ๛ , asterisk * , back Solidus \ , ampersand & , number # , percent % , plus + , less Than < , equal = , greater than > , virtical line | |
digits, English alphabet, Thai consonants, Ru (ฤ), Lu (ฦ), Nikkhahit, Thai vowels, Yamakkan, Pinthu, Thanthakhat, Mai Taikhu, Thai Tonal Marks |
- |
4 |
English upper case, Thai extra (Lakkhang Yao) |
digits, English lower case letters, Thai consonants, Ru (ฤ), Lu (ฦ), Nikkhahit, Thai vowels, Yamakkan, Pinthu, Thanthakhat, Mai Taikhu, Thai Tonal Marks |
punctuation marks |
After this model, Theppitak Karoonboonyanan has written a C++ library named ThColl, which is downloadable.