Thai vs Indic Encoding Scheme

I write this page after having heard a lot in several seminars and mailing lists about critiques on Thai (and Lao) encoding scheme that is different from other Indic scripts. Although I can see some valid points in those arguments, I feel overstatement in it. And it seems to need some clarification.

Most often, it's about the Thai "visual" order, versus Indic "logical" order. In Thai, you encode characters from left to right in visual order. So, leading vowel is encoded before the following consonant. But according to Indic encoding scheme, you always encode consonant first, followed by vowel. Both have profits and drawbacks. But they are frequently compared when talking about string collation, where Thai compares consonant before the leading vowel. This has been occasionally overstated as "phonetic analysis", and all the myths are spread. Well, it's not that bad, actually, as you shall see. And I still think the current encoding scheme is appropriate for Thai.

History of Thai Script

Oldest evidence dated in 1826 B.E. (1283 A.D.) by King Ramkhamhaeng The Great of Sukhothai Kingdom.
Scripts available before that include Mon, Tham and Khmer.

King Ramkhamhaeng's script

A revolutionized writing system. Many complications were eliminated.
Alphabets were based on Bhrami family.
2 tone marks were introduced.
Individual characters were separated (no conjuncts).
Most characters could be written with a single stroke.
All consonants and vowels were written in the same line, except the tone marks, which were put above the consonant.

Later changes

Some vowels were put to upper/lower level, obviously influenced by Khmer script.
Some consonants shapes evolved.
Additional signs invented.
Changes settled in Ayutthaya age.

Character Encoding in Computer

Characters are encoded visually from left to right.
For upper/lower marks, encode vowel after consonant and before tone/diacritic.

Backgrounds

Encoding scheme just reflects the typewriter practice.
Typewriter: easy to implement because of the script's nature which is different from many other Indic scripts.

Consequence

National encoding was established (1986) long before Unicode.
Thai support systems had well settled with a specification agreed upon by vendors.
Large amount of Thai documents encoded with this scheme.

→ Indic-style encoding not preferred.

Issues

Input Method

Like most other LTR scripts I heard of, Thai is preferably written from left to right in visual order. And this is the preferred input method for most Thai users.
For current visual encoding scheme, the input method is obviously straightforword. No question.
Being encoded as per Indic encoding scheme, Thai input method would have to convert the visual order into the so-called "logical" order, as reported to be done in many Indic scripts. The conversion for Thai, however, would be complicated, and in many cases context sensitive. This requires heavy read/write accesses to application input buffer to dynamically update the "logical" form as the context changes. Note that the lazy preedit-and-commit scheme is not sufficient for editing existing text at random mouse-clicked carets.

Collation

Thai collation standard sorts visually, with two special treatments:
- Tone marks are ignored at first pass and considered at second pass
  → easily covered by the CTT
- Leading vowels are considered after its immediate following character
  → solved by either:
  - reordering as preprocessing (Unicode TR #10, as trivial as swapping two adjacent characters)
  - using `contractions' (glibc tailoring)
Indic-style encoding, if taken at phonetic level, would introduce non-standard orders. An over-complicated reverse phonetic analysis would be required.

Word break/line wrapping

Yes, it's a problem not easily achieved at 100% accuracy.
Indic-style encoding scheme, however, could give some hints when phonetic analysis is taken.

Conclusion

With Indic-style encoding, the complication would just be moved from application level to input method, for which adequate protocol support is still rare.
Meanwhile, string collation is another factor. If Indic scheme were applied at phonetic level, it would over-complicate both the string collation and input method. If it were applied just to suffice string collation without rearrangement, it wouldn't help much with word analysis.
On the other hand, Thai visual encoding eases string collation, and allows more intuitive and simpler input methods. Although it leaves some disambiguation problem to applications, but a complete encoding scheme that covers all complexities of Thai orthography would be complicated and would cost much implementation in other aspects.
Therefore, Thai is encoded as it currently is.

Disclaimer

All information in this page is just my personal opinion.
It can't be taken as a reference.

Resources

Thai Input Method Implementations, a story of the nature of Thai input method and some implementations
Thai Sorting Algorithms, about Thai string collation specification and implementations
Thai-English Bilingual Sorting, an enhancement of Thai string collation specification, as part of fullfilling international standards like ISO/IEC 14651 and Unicode TR#10
Thai Locale, a description of the full implementation of Thai POSIX locale