Thai Sorting Algorithms


The Problems

The C standard library function strcmp(), based on sequential comparisons between the ASCII codes of the corresponding characters of the two argument strings, works well with English entries, but not with Thai:

	int strcmp(const char* s1, const char* s2)
	{
	    while (*s1 == *s2 && *s1 != 0) { s1++; s2++ }
	    return *s1 - *s2;
	}

Because of Thai orthography, there are two major problems to be concerned in sorting Thai words:

  1. The leading vowels, although written first, are less significant than the initial consonant of the syllable.
  2. The tonal marks, Mai Tai Khoo, and Thantakhat should be ignored unless all other parts are all equal.

A Solution

Samphan Raruenrom's solution [1] is:

	/* default char type should be unsigned */
	#define isldvowel(c)  (0xE0<=(c) && (c)<=0xE4)
	#define isstone(c)    (0xE6<=(c) && (c)<=0xEC)

	int tstrcmp(const char *s1, const char *s2)
	{
	  char c1, c2;
	  int  d = 0;

	  for (;;) {
	    while (*s1 == *s2 && *s1 != 0) { s1++; s2++; }
	    c1 = *s1; c2 = *s2;
	    if (isstone(c1) && isstone(c2)) {
	      s1++; s2++; if (d == 0) d = c1 - c2;
	    } else if (isstone(c1)) {
	      s1++; if (d == 0) d = 1;
	    } else if (isstone(c2)) {
	      s2++; if (d == 0) d = -1;
	    } else break;
	  }

	  if (c1 == 0 && c2 == 0) return d;
	  if (c1 == 0 || c2 == 0) return c1 - c2;

	  if (isldvowel(c1) && isldvowel(c2)) {
	    return s1[1] != s2[1] ? s1[1] - s2[1] : c1 - c2;
	  } else if (isldvowel(c1)) {
	    return s1[1] != c2 ? s1[1] - c2 : 1;
	  } else if (isldvowel(c2)) {
	    return c1 != s2[1] ? c1 - s2[1] : -1;
	  } else return c1 - c2;
	}

The first for loop compares consequence characters, skipping tonal marks, until different effective characters (i.e. consonants or vowels) are found, or until either or both strings are exhausted. Meanwhile, first-appearing difference in tonal marks is also kept for later use in case all other parts are equal.

The next if statement returns the difference in tonal marks when all other parts appears equal.

The next if statement detects whether a string is a substring, if tonal marks are ignored, of the other and returns the difference if it is the case.

The last if blocks compares the difference for most general cases. The following initial consonant will be picked to compare first if the character in question is a leading vowel.

A Broader Problem

In a more general case, punctuation marks are allowed to appear. A number of Thai punctuation marks could appear in dictioanary entries or telephone directories:

To sort Thai words with punctuation marks, the marks must be given lower precedence than ordinary letters. Without punctuation marks concern, the basic algorithm divides Thai letters into two classes of precedence. The first priority letters are the consonants and vowels. The second are the tonal marks, Mai Tai Khoo, and Thantakhat. With punctuation marks, there must be more than two.

It is obvious that punctuation marks must be the third class. Nevertheless, we may refine the classification among the puctuation marks, according to their roles. Some marks represent the omitted syllables, such as Pai Yan Noi, Mai Yamok, and period, while the others are mere delimiters. Therefore, we divide Thai characters into four classes of sorting precedence:

  1. Consonants and Vowels
  2. Tonal Marks, Mai Tai Khoo, and Thantakhat
  3. Pai Yan Noi, Mai Yamok, and Period
  4. Space, Hyphen, Comma, Semicolon, Colon, Slash, Exclamation Mark, Question Mark, Quote, and Single Quote

A Generic Solution

Theppitak Karoonboonyanan [2] extends Samphan's algorithm as follows:

	#define MAX_LEVELS  8  /* maximum levels allowed */

	#define isldvowel(c)  (0xE0<=(c) && (c)<=0xE4)

	typedef unsigned char tchar;  /* Thai character type */

	/* To be parameterized ... */
	extern int level(tchar c);
	extern const int TotalLevels;

	int LeveledTStrCmp(const tchar* s1, const tchar* s2)
	{
	    tchar c1, c2;
	    int d[MAX_LEVELS];
	    int i;
	    int level1, level2;

	    /* initialize differences to zero */
	    for (i = 0; i < TotalLevels; i++)  d[i] = 0;

	    /* compare, keeping diff of each class individually */
	    for (;;) {
	        while (*s1 == *s2 && *s1 != 0) { s1++; s2++; }
	        c1 = *s1; c2 = *s2;
	        level1 = level(c1); level2 = level(c2);

	        if (level1 == level2) {
	            if (level1 == 0) break;
	            s1++; s2++; if (d[level1] == 0) d[level1] = c1 - c2;
	        } else if (level1 > level2) {
	            s1++; if (d[level1] == 0) d[level1] = 1;
	        } else {
	            s2++; if (d[level2] == 0) d[level2] = -1;
	        }
	    }

	    /* primary comparison yields zero? */
	    if (c1 == 0 && c2 == 0) {
	        for (i = 1; i < TotalLevels; i++) {
	            if (d[i] != 0) return d[i];
	        }
	        return 0;
	    }

	    /* shorter word? */
	    if (c1 == 0 || c2 == 0) return c1 - c2;

	    /* compare at primary different point, */
	    /* with leading vowel check */
	    if (isldvowel(c1)) {
	        if (isldvowel(c2) {
	            return (s1[1] != s2[1]) ? s1[1] - s2[1] : c1 - c2;
	        } else {
	            return (s1[1] != c2) ? s1[1] - c2 : 1;
	        }
	    } else {
	        if (isldvowel(c2) {
	            return (c1 != s2[1]) ? c1 - s2[1] : -1;
	        } else {
	            return c1 - c2;
	        }
	    }
	}

Sorting Thai with strcmp()

In 1969, D. Londe and Udom Warotamasikkhadit ([3], as quoted in [4]) introduced the first sorting algorithm for Thai. The algorithm converts Thai string into an ASCII-sortable form. The procedure is as follows:

  1. For every leading vowel appearing in the string, swap it with the next character.
  2. Append two zero digits to the string.
  3. Scan the string from left to right. For each tonal marks, Mai Tai Khoo, or Thantakhat found, remove it, append to the string 2 digits representing its original position from the string tail, and then append the mark itself.

For example,

conversion example

This approach is applicable to the uninternationalized DBMS, to which Samphan's tstrcmp() or Theppitak's LeveledTStrCmp() function cannot be easily added. One can create an additional field keeping the ASCII-sortable form, which DBMS can use to sort the entries.

Such conversion could be expressed in a C function like this:

	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <assert.h>

	typedef unsigned char tchar;  /* Thai character type */

	#define isstone(c)    (0xE6<=(c) && (c)<=0xEC)
	#define isldvowel(c)  (0xE0<=(c) && (c)<=0xE4)

	/*
	 * Precondition:
	 *    bufSize >= strlen(tstr) + 3*(t+1)
	 *    where t = number of tonal marks + Mai Tai Khoo + Thantakhat
	 * Returns:
	 *    non-zero if successful, outBuf = the sortable form of tstr;
	 *    0 otherwise
	 */
	int Thai2Sortable(const tchar* tstr, tchar* outBuf, int bufSize)
	{
	    const int len = strlen((const char*)tstr);
	    const tchar* p = tstr;
	    tchar* const leftBuf = outBuf;
	    tchar* const rightBuf = leftBuf + len + 2;
	    tchar* const maxRight = outBuf + bufSize;
	    tchar* pLeft = leftBuf;
	    tchar* pRight = rightBuf;

	    if (rightBuf >= maxRight)  return 0;

	    while (*p != '\0') {
	        if (isldvowel(*p) && p[1] != '\0') {
	            assert(!isldvowel(p[1]));
	            assert(!isstone(p[1]));
	            *pLeft++ = p[1];
	            *pLeft++ = *p;
	            p += 2;
	        } else if (isstone(*p)) {
	            if (pRight + 3 > maxRight) return 0;
	            sprintf((char*)pRight, "%02d", (tstr + len) - p);
	            pRight += 2;
	            *pRight++ = *p++;
	        } else {
	            *pLeft++ = *p++;
	        }
	    }
	    *pLeft++ = '0';
	    *pLeft++ = '0';
	    memmove(pLeft, rightBuf, pRight - rightBuf);
	    pLeft[pRight - rightBuf] = '\0';

	    return 1;
	}

References

[1]
Samphan Khamthaidee, Thai Style Algorithm: Sorting Thai, Computer Review, Vol. 91 (March 1992), Man Group, Bangkok, 1992. (in Thai)
[2]
Theppitak Karoonboonyanan, Sorting Thai Words with Punctuation Marks, NECTEC Journal, Vol. 14 (Jan-Feb 1997), National Electronics and Computer Technology Center, Bangkok, 1997. (in Thai)
[3]
Londe, D., Warotamasikkhadit, U., Computerized Alphabetization of Thai, Tech. Memo. TM-BA-1000/000/01, System Development Corporation, CA. 1969.
[4]
Vichit Lohcheerachoonhakool, Thai Sorting Algorithm, Thai Algorithm: Design, Analysis and Application, NIDA and Electrical Generation Authority of Thailand, Bangkok, 1979. (in Thai)
free html hit counter