Thai vs Indic Encoding Scheme
I write this page after having heard a lot in several seminars and mailing
lists about critiques on Thai (and Lao) encoding scheme that is different from
other Indic scripts. Although I can see some valid points in those arguments,
I feel overstatement in it. And it seems to need some clarification.
Most often, it's about the Thai "visual" order, versus Indic "logical"
order. In Thai, you encode characters from left to right in visual order.
So, leading vowel is encoded before the following consonant. But according to
Indic encoding scheme, you always encode consonant first, followed by vowel.
Both have profits and drawbacks. But they are frequently compared when talking
about string collation, where Thai compares consonant before the leading vowel.
This has been occasionally overstated as "phonetic analysis", and all the
myths are spread. Well, it's not that bad, actually, as you shall see.
And I still think the current encoding scheme is appropriate for Thai.
History of Thai Script
- Oldest evidence dated in 1826 B.E. (1283 A.D.)
by King Ramkhamhaeng The Great of Sukhothai Kingdom.
- Scripts available before that include Mon, Tham and Khmer.
King Ramkhamhaeng's script
- A revolutionized writing system. Many complications were eliminated.
- Alphabets were based on Bhrami family.
- 2 tone marks were introduced.
- Individual characters were separated (no conjuncts).
- Most characters could be written with a single stroke.
- All consonants and vowels were written in the same line,
except the tone marks, which were put above the consonant.
Later changes
- Some vowels were put to upper/lower level, obviously influenced by
Khmer script.
- Some consonants shapes evolved.
- Additional signs invented.
- Changes settled in Ayutthaya age.
Character Encoding in Computer
- Characters are encoded visually from left to right.
- For upper/lower marks, encode vowel after consonant and before
tone/diacritic.
Backgrounds
- Encoding scheme just reflects the typewriter practice.
- Typewriter: easy to implement because of the script's nature
which is different from many other Indic scripts.
Consequence
- National encoding was established (1986) long before Unicode.
- Thai support systems had well settled with a specification
agreed upon by vendors.
- Large amount of Thai documents encoded with this scheme.
→ Indic-style encoding not preferred.
Issues
Input Method
- Like most other LTR scripts I heard of, Thai is preferably
written from left to right in visual order. And this is the
preferred input method for most Thai users.
- For current visual encoding scheme, the input method is obviously
straightforword. No question.
- Being encoded as per Indic encoding scheme, Thai input method
would have to convert the visual order into the so-called "logical"
order, as reported to be done in many Indic scripts. The conversion for
Thai, however, would be complicated, and in many cases context sensitive.
This requires heavy read/write accesses to application input buffer to
dynamically update the "logical" form as the context changes. Note that
the lazy preedit-and-commit scheme is not sufficient for editing existing
text at random mouse-clicked carets.
Collation
- Thai collation standard sorts visually, with two special treatments:
- Tone marks are ignored at first pass and considered at second
pass
→ easily covered by the CTT
- Leading vowels are considered after its
immediate following character
→ solved by either:
- reordering as preprocessing (Unicode TR #10, as trivial as
swapping two adjacent characters)
- using `contractions' (glibc tailoring)
- Indic-style encoding, if taken at phonetic level, would
introduce non-standard orders. An over-complicated reverse
phonetic analysis would be required.
Word break/line wrapping
- Yes, it's a problem not easily achieved at 100% accuracy.
- Indic-style encoding scheme, however, could give some hints when
phonetic analysis is taken.
Conclusion
- With Indic-style encoding, the complication would just be moved from
application level to input method, for which adequate protocol support
is still rare.
- Meanwhile, string collation is another factor. If Indic scheme were
applied at phonetic level, it would over-complicate both the string
collation and input method. If it were applied just to suffice string
collation without rearrangement, it wouldn't help much with word
analysis.
- On the other hand, Thai visual encoding eases string collation, and
allows more intuitive and simpler input methods. Although it leaves
some disambiguation problem to applications, but a complete encoding
scheme that covers all complexities of Thai orthography would be
complicated and would cost much implementation in other aspects.
- Therefore, Thai is encoded as it currently is.
Disclaimer
- All information in this page is just my personal opinion.
- It can't be taken as a reference.
Resources