Thai Script Shaping

Problem Definition

Thai script is composed of multiple levels of stacking characters. On the base line are the consonants and some leading or following vowels. Then, the upper or lower vowel may combine above or below the consonant. And the stack can then be finalized by tone mark or diacritic.

In summary, a Thai grapheme cluster can be one of the forms:

  1. LV|FV
  2. C [BV|AV] [T|AD]

where:

The first form is simple: just individual character. It's the second form that the shaping rules must deal with.

Typewriter Positioning

Primitive typesetting as used in mechanical typewriters and framebuffer consoles is to put characters in fixed vertical levels to prevent overlapping:

And these are the default positions for glyphs in most Thai fonts, where bare combining without shaping still yields a readable, though suboptimal, rendered text.

Consonant Classification for Shaping

Some Thai consonants have extra ascender or descender which can overlap the combining character. Some rearrangement needs to be done to avoid this. Based on such rearrangement operations, Thai consonant can be classified into 4 classes:

  1. Normal consonants (NC), without extra ascender/descender
  2. Consonants with right extra ascender (AC), namely ป (PO PLA, U+0E1B), ฝ (FO FA, U+0E1D), ฟ (FO FAN, U+0E1F) and in some fonts ฬ (LO CHULA, U+0E2C)
  3. Consonants with removable descender (RC), namely ญ (YO YING, U+0E0D) and ฐ (THO THAN, U+0E10)
  4. Consonants with strict descender (DC), namely ฎ (DO CHADA, U+0E0E) and ฏ (TO PATAK, U+0E0F)

Shaping Rules

The general rules are:

  1. SARA AM (U+0E33) must be decomposed into NIKHAHIT (U+0E4D) and SARA AA (U+0E32). And if a tone mark (T) is present before it, the NIKHAHIT must be reordered so it comes before the tone mark.
  2. If tone mark (T) is present without upper vowel (AV), it must be lower down, with probably scaled up size.
  3. Any above-base combining mark (T, AV, AD) that combines to consonant with extra ascender (AC) must be shifted left.
  4. If below vowel (BV) combines to consonant with strict descender (DC), it must be lowered down.
  5. Descender of consonant with removable descender (RC) must be removed when combined with below vowel (BV), and the BV needs not be shifted.

This can be summarized as following table:

base \ comb AV* BV T** {AV}T**
NC - - SD(c) -
AC SL(c) - SDL(c) SL(c)
RC - RD(b) SD(c) -
DC - SD(c) SD(c) -

* MAITAIKHU (U+0E47), NIKHAHIT (U+0E4D) and YAMAKKAN (U+0E4) are treated as AV here.

** THANTHAKHAT (U+0E4C) is treated as T here.

where:

c in the parameter means the combining mark, and b means the base consonant.

PUA Shaping

Before OpenType, some ad hoc shaping solutions had been developed by vendors, all of which were logically the same positioning-by-substitution technique with the same extra glyph sets, but unfortunately with different code point assignments. There were two major extensions, one for Microsoft Windows, and the other for Apple MacOS. The discrimination had barred the fonts for both platforms from being interchanged, as fonts were tied to vendors' rendering engines.

To support those legacy fonts, which still dominate the market at present, rendering engines should know how to access those private glyphs and substitute them properly.

There were 2 sets of each vendor extension, one for 8-bit pre-Unicode fonts, and the other for Unicode Private Use Area (PUA). 8-bit fonts are rare nowadays, but let us mention them here for historical reference.

Glyph 8-bit Windows Windows PUA 8-bit Mac Mac PUA
Low MAI EK 0x8B U+F70A 0x88 U+F88B
Low MAI THO 0x8C U+F70B 0x89 U+F88E
Low MAI TRI 0x8D U+F70C 0x8A U+F891
Low MAI CHATTAWA 0x8E U+F70D 0x8B U+F894
Low THANTHAKHAT 0x8F U+F70E 0x8C U+F897
Low-left MAI EK 0x86 U+F705 0x83 U+F88C
Low-left MAI THO 0x87 U+F706 0x84 U+F88F
Low-left MAI TRI 0x88 U+F707 0x85 U+F892
Low-left MAI CHATTAWA 0x89 U+F708 0x86 U+F895
Low-left THANTHAKHAT 0x8A U+F709 0x87 U+F898
Left MAI EK 0x9B U+F713 0x98 U+F88A
Left MAI THO 0x9C U+F714 0x99 U+F88D
Left MAI TRI 0x9D U+F715 0x9A U+F890
Left MAI CHATTAWA 0x9E U+F716 0x9B U+F893
Left THANTHAKHAT 0x9F U+F717 0x9C U+F896
Left MAI HAN-AKAT 0x98 U+F710 0x92 U+F884
Left SARA I 0x81 U+F701 0x94 U+F885
Left SARA II 0x82 U+F702 0x95 U+F886
Left SARA UE 0x83 U+F703 0x96 U+F887
Left SARA UEE 0x84 U+F704 0x97 U+F888
Left MAITAIKHU 0x9A U+F712 0x93 U+F889
Left NIKHAHIT 0x99 U+F711 0x8F U+F899
Low SARA U 0xFC U+F718 0xFC U+F89B
Low SARA UU 0xFD U+F719 0xFD U+F89C
Low PHINTHU 0xFE U+F71A 0xFE U+F89D
Desc-less YO YING 0x90 U+F70F 0x90 U+F89A
Desc-less THO THAN 0x80 U+F700 0x80 U+F89E

Notes

OpenType Shaping

Modern fonts should no longer rely on PUA, as modern software begins to support OpenType more widely.

Please refer to Spec for Thai OpenType Font Creation for the details.


free html hit counter