XFree86 Thai Supports

Character Sets
Output Methods
Input Methods
Resource Summary

Character Sets

Native Thai character set is 8-bit code set (TIS-620), and tons of existing text files and exchanged e-mails are encoded in TIS-620. And the character set has been supported by GNU C library th_TH locale since 2.1.1 version.

However, as the world is moving toward the UCS (ISO/IEC 10646-1), X begins to support UCS in many aspects. As Thai characters have been covered since the very first effort by Unicode Consortium, Thai has passed the stage of character encoding definition for multilingual supports.

TIS-620

TIS-620 is the national standard for Thai character set for use in computers. It is an extension to ISO 646, a 7-bit code set which is very close to ASCII. TIS-620 was first published in 1986 (2529 B.E.) and was amended in 1990 (2533 B.E.) in some details for the conformance with ISO/IEC 2022, but the code table was still the same.

TIS-620 has been supported by GNU C library under the th_TH locale. The iconv(3) can be used to convert between TIS-620 and other encodings.

In X11R6, TIS-620 has been defined as TIS620.2533 in the th_TH.TACTIS locale. In fact, TACTIS is an extension to TIS-620 with the word break character (0xDC) proposed by the Thai API Consortium. But that has never been adopted anywhere, and no existing application supports the word break character. (We have the Zero-Width Space [ZWSP U+200B] equivalence in the ISO/IEC 10646-1 table.) Therefore, we propose to use th_TH.TIS-620 locale instead. Moreover, the equivalence of TIS620.2529 and TIS620.2533 in the code table has caused confusion among font vendors and applications. So, we propose to remove the Buddhish Era extension and use just TIS620 instead.

Chanop Silpa-Anan has provided a patch against XFree86 4.0.2 for the code set change.

ISO/IEC 8859-11

The 8-bit nature of the TIS-620 code table is very similar to those for western scripts, which are defined in ISO/IEC 8859. Therefore, TISI has tried to promote the TIS-620 code table to the international industry by proposing to encode a Latin/Thai part in ISO/IEC 8859 with the same table. The table was assigned as part 11 of the standard. Although once rejected because of the combining characters in Thai script which make it different from other Latin scripts, it has been reactivated and proclaimed as international standard in December 2000. I have kept an old link to its FCD, which you may study, but it can't be reference, anyway.

In essense, ISO-646 + TIS-620 + 0xA0 = ISO-8859-11, and the new standard has become known in more internationalized applications. Therefore, we will support both encodings in parallel, and expect ISO-8859-11 to be our future encoding.

GNU C library 2.3 has already supported th_TH.ISO-8859-11 locale. In addition, I have made a patch to add th_TH.ISO8859-11 locale to XFree86 4.2.1.

ISO/IEC 10646-1

Many applications now have moved toward multilingual support, in which the core character set standard is ISO/IEC 10646 (Universal Multi-octet Coded Character Set - UCS). If you are familiar with Unicode, it is essentially the same. Unicode Consortium was founded before ISO/IEC adopted its job and formed a working group to make it an international standard. Now Unicode Consortium is sitting in the committee to propose drafts and votes with other delegates from member countries.

For X, there is a high momentum of the UCS support. The LI18NUX has promoted the use of UTF-8 locales provided by GNU C library. In the X itself, there have been multibyte and wide-char supports in X library. xterm now supports UTF-8 and some multilingual text rendition. Markus Kuhn has prepared a set of BDF UCS fonts that contains characters of as many languages as possible (Thai included), which has been included in XFree86 4.0. The Pango rendering engine of the GTK+ toolkit has provided multilingual text rendering for many languages based on UCS encoding. Qt has also supported UCS since its 2.0 version.

For more information about the use of UCS in Linux, please visit Markus Khun's UTF-8 and Unicode FAQ for Unix/Linux.

Thai has been allocated in the range of 0x0E00-0x0E7F of Unicode, and hence ISO/IEC 10646-1, with the same layout as the TIS-620 national standard.

For more information on Thai character sets, please visit Trin Tantsetthi's An annotated reference to the Thai implementations.

I have made a patch to add th_TH.UTF-8 locale, as well as UTF-8 support in Thai XIM, to XFree86 4.2.1. Note that this patch requires another XIM patch to be applied first.

Output Methods

Font Encoding Topics

This section will discuss various encodings which can be applied to make applications display Thai texts on X Window applications.

Latin Exploit

Some European-centric applications have no any idea about Thai character sets, and in principle, should not be able to display Thai at all. However, we still need to browse Thai web pages on X, and that capability is so mandatory for Thai users who migrate from other OS. To achieve this, with the similar 8-bit nature of Latin and Thai character sets, we can deceive those applications by using Thai font under the name of iso8859-1 encoding. This can be done by creating alias entries in fonts.alias file in your font directory, like this:

  -thai-fixed-medium-r-normal--14-100-100-100-m-70-iso8859-1    -thai-fixed-medium-r-normal--14-100-100-100-m-70-tis620-0

Applications which can be deceived this way include:

Netscape
- To type Thai text in the (Motif) text box, set fixed-width font of Western encoding to an aliased Thai font. (Netscape 4.7x and later doesn't allow zero-width characters, though, which makes you unable to type upper/lower vowels. It is recommended that you use fonts for mule [e.g. -etl-fixed-...] in this case.
- To make drop-down list boxes and buttons in HTML forms display Thai text, set variable-width font of Western encoding to an aliased Thai font.
- However, this recommendation does not imply that you should also use iso8859-1 as the encoding of your web pages, if you are a web creator, because some picky web browsers will map your Thai characters to Latin unicode and display Latin instead. The recommended charset for Thai web pages is tis-620, which has been officially registered with IANA. windows-874 is an old alternative which was supported by Internet Explorer only. Explorer 5.0 and later now supports tis-620, and Mozilla recognized the charset before that. Therefore, there is no reason for not using tis-620 now. For more information, please visit the tis-620 campaign page, and a comprehensive guide.

TIS-620 Family

There have been names found to be used as Thai character encodings:

TIS620.2533-0 defined in X11 XLC_LOCALE, used by some font creators such as Phaisarn, known by Pango rendering engine, etc.
TIS620.2529-1 used by some font creators such as ETL, Phaisarn, and Manop.

However, with the need of adjustments for elegant display and typesetting of Thai text, coined Shaping in developer's language, there has been a convention set for naming different Thai code sets in XLFD, as follows:

tis620-0 for fonts that provide glyphs for plain TIS-620 characters
tis620-1 for fonts that provide MacThai extension to TIS-620
tis620-2 for fonts that provide Windows extension to TIS-620

Please note the omission of .2533 and .2529 Buddhist Era extensions to prevent the confusion, because the two versions of standard are essentially the same.

We are working out so that new implementations adopt this convention for Thai text rendering, although the old font sets might still be supported for backward compatibility. The implementation known to support this convention for now is the Pango engine of GTK+ project.

I have provided a patch for ttmkfdir to generate tis620-[012] entries for the fonts.scale file from Thai TrueType fonts.

ISO/IEC 10646-1 Fonts

Markus Kuhn has prepared a set of BDF UCS fonts that contains characters of as many languages as possible (Thai included), which has been included in XFree86 4.0.

Modern TrueType fonts for Windows are now encoded in Unicode, although most Thai TrueType fonts for MacIntosh are still based on 8-bit mapping. Saying that a TrueType font is Unicode means its glyphs are indexed with the same code values as their corresponding characters, so that a Unicode string can be directy rendered without further mapping. MacIntosh fonts are said to be based on 8-bit mapping although their glyphs are internally indexed by 16-bit integers because those indices are not equal to their Unicode character codes. What appears to external world is their mapping tables that map 8-bit characters to their ad hoc 16-bit glyph indices. Therefore, it can be said that Windows TrueType fonts now support ISO/IEC 10646-1, and can be used in XFree86 via its TrueType support, while those in MacIntosh are not necessarily applicable, until their new MacOS versions fully support Unicode.

TrueType Font Supports

Before XFree86 4.0, native fonts for X were mainly BDF and Postscript. However, many efforts have been made to use TrueType fonts via the X font server (xfs), for example, xfstt TrueType renderer by Herbert Dürr, xfsft by Mark Leisher and Juliusz Chroboczek, and X-TrueType (or X-TT) by Takuya SHIOZAKI. xfstt renders TrueType fonts by itself, while the other two employs the FreeType library. X-TT was designed especially for East Asian fonts.

XFree86 4.0 has incorporated the xfsft (as the freetype module) and X-TT (as the xtt module) for TrueType font support. So, one can use TrueType fonts either via the XFree86 X server with one of the above modules loaded, or via the X font server.

In most TrueType fonts nowadays, glyphs are indexed using Unicode, so as to be able to contain glyphs of many languages in one font, and to reserve area for glyph variations for some languages. Therefore, there needs to be some character mapping mechanism from the local character set in use to the corresponding Unicode glyphs. (The map from local character set into the ISO/IEC 10646-1 table is called repertoire map.) This mapping is essential for the support of the so-called legacy character sets in Unicode-based applications, including the TrueType font servers.

xfsft Font Server

xfsft hardwires internal conversion tables for iso10646-1, iso8859-<n> for n = 1 to 10 and 15, koi8-r, koi8-u, koi8-ru, koi8-uni, and koi8-e for all scalable fonts (Type1, Speedo, TrueType); microsoft-symbol for TrueType; and apple-roman for Apple TrueType. Apart from these, one can create a repertoire map for their own character set under the /usr/X11R6/lib/X11/fonts/encodings directory. (See the README file in the directory for the description on the structure of the .enc files, and ISO8859-11.enc there for TIS-620 repertoire map.) xfsft will find these files via the encodings.dir in the font directory (just similar to fonts.dir in font listing).

To add a new repertoire map, you put your .enc in the /usr/X11R6/lib/X11/fonts/encodings directory. Then, to make your TrueType fonts mapped when rendering text of your new encoding, go to the directory that contains your TrueType fonts and do the following steps:

Add entries to the fonts.scale file for your fonts, specifying your new encoding in the last two fields of the XLFD. Here's an example of the fonts.scale file:

       4
       angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-iso8859-1
       angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-iso8859-11
       angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-tis620-0
       angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-tis620-2

This step may be automated using Jöerg Pomnitz' ttmkfdir utility, if your encoding is supported by it:

       $ ttmkfdir > fonts.scale

Or, as an alternative in XFree86 4.3.0, you can also use the mkfontscale command:

       $ mkfontscale

Generate fonts.dir and encodings.dir using mkfontdir with -e your-encoding-dir option (in general your-encoding-dir is /usr/X11R6/lib/X11/fonts/encodings):
```
       $ mkfontdir -e /usr/X11R6/lib/X11/fonts/encodings
```

To use TrueType fonts in your XFree86 4.x, you can use one of these methods:

Using font server
This method can also be used with XFree86 3.3.x.
1. Set your font server's /etc/X11/fs/config file. In general, if your distribution already has xfs pre-configured, all you need to do is add the new TrueType font path to the catalogue variable. For example:
```
    clone-self = on
    use-syslog = off
    catalogue = /usr/X11R6/lib/X11/fonts/TrueType/,/usr/share/fonts/TrueType/
    error-file = /usr/X11R6/lib/X11/fs/fs-errors
    # in decipoints
    default-point-size = 120
    default-resolutions = 75,75,100,100
```
2. Start your font server, either by command line, like this:
```
    # xfs -port 7100 &
```
  or by using your distribution's init.d script, like this for Debian:
```
    # /etc/init.d/xfs restart
```
  or for RedHat:
```
    # /etc/rc.d/init.d/xfs restart
```
3. Add the xfs service to your X font path and refresh it:
```
    $ xset +fp tcp/localhost:7100
    $ xset fp rehash
```
  or to add it permanently, add this line to your /etc/X11/XF86Config file:
```
    Section "Files"
        ...
        FontPath   "tcp/localhost:7100"
        ...
    EndSection
    
```
Using freetype module
This cannot be applied to XFree86 3.3.x.

X-TrueType Font Server

Shaping

XOM

Toolkits

Input Methods

XKB

XKB is the keyboard map for X. It is used in converting key strokes to key symbols defined in <X11/keysymdefs.h>, with language switching capability. Another X routine converts the key symbols into corresponding character codes.

You can set the XKB map on the fly using the setxkbmap command. For example, to use Thai keymap, you type this in an X terminal:

    $ setxkbmap th

To set it permanently, you can set this line in your /etc/X11/XF86Config file:

    Section "Keyboard"
       ...
       Option  "XkbLayout"     "th"
       ...
    EndSection

or use XF86Setup or xf86config utility, in the keyboard section.

However, beginning in XFree86 4.3.0, the default XKB rule allows keymap composing from 2-4 maps, each with a single group. Therefore, to use Thai-English bilingual keyboard, you need to compose the us keymap and th together, and may define your group toggle key using XKB option:

    $ setxkbmap us,th -option grp:alt_shift_toggle

or, in /etc/X11/XF86Config:

    Section "Keyboard"
       ...
       Option  "XkbLayout"     "us,th"
       Option  "XkbOptions"    "grp:alt_shift_toggle"
       ...
    EndSection

Thai XKB map has been introduced into X Window since X11R6. The map (/usr/X11R6/lib/X11/xkb/symbols/th) defined two maps, one for English, and the other for Latin1 with the same numerical values as TIS-620. This had made Thai keyboard input in X work fine for a while, although it is kind of Latin hack and the keysyms defined for Thai were not used at all.

Pablo Saratxaga has contributed the Thai XKB map using the Thai keysyms in XFree86 4.0.1d. This is the point that triggered later fixings which make Thai input system more complete. By the time the new XKB map was used, many uninternationalized applications failed to accept Thai keys. So, ]d has provided an XKB map with 3 maps, one for English, another for Latin1-equivalence of TIS-620, and the last one for genuine Thai keysyms. This map had well served Thai keyboard input during the transition stage, although it was not actually incorporated in XFree86.

In addition, I have provided a patch to add th_tis (TIS-820.2538) XKB map to XFree86 4.3.0 as well.

Technically speaking, Thai keyboard translation using the XKB map with Thai keysyms needs the locale to be set to th_TH (by both LANG/LC_CTYPE environment setting and applicaion's setlocale() calling), and the XmbLookupString() call in translation (instead of the old XLookupString()). With the new Thai XIM in XFree86 4.0.1g, the translation will be done in the XFilterEvent() call. (See XIM below.)

XIM

Thai XIM has been implemented using the library model in the X library. But it did't work with the new features later introduced in X. With the Thai POSIX locale introduced in GNU C library, the XIM has been activated and appears to have some defects in the key translation. Pruet Boonma raised a notice of the Thai XIM support in X library and the need of the fixing. However, this has not been fixed for a long time, and an immediate solution by setting LC_CTYPE to C has been employed during the period.

This has been raised into discussion in Thai Linux Working Group again later. ]d had done an initial analysis before asking me if I was interested in working it out. And I said yes.

The result was a patch against XFree86 4.0.1f which was checked-in and active in XFree86 4.0.2. To use Thai XIM, you need to do the followings:

Set locale to th_TH via the LC_CTYPE environment:
```
    $ export LC_CTYPE=th_TH
```
(Note that if LC_CTYPE is not set, LC_ALL and LANG will be checked and used, respectively.)
Set im X locale modifiers to select the input sequence check mode you want, by setting the XMODIFIERS environment to "@im=mode", where mode is one of the followings:
- Passthrough for no sequence check (WTT level 0)
- BasicCheck for basic sequence check (WTT level 1)
- Strict for strict sequence check (WTT level 2)
- Thaicat for THAICAT sequcence check (no info yet)
For example:
```
    $ export XMODIFIERS="@im=BasicCheck"
```
selects the basic sequence checking method.
If XMODIFIERS is not set, the Thai XIM will default to BasicCheck

Note that the above two steps will be successful only by the cooperation of the application and the system setting. The first step needs the application to call setlocale(LC_CTYPE, ""), the C library to provide th_TH locale, the X Window to provide th_TH X locale with the proper Thai font set (both in the XLC_LOCALE description and in the hard-wired code, which needs Chanop's patch to be applied if you want to use TIS620 font set), and (optionally) the Thai fonts with the corresponding font set to be installed. The second step needs the application to call XSetLocaleModifiers("") before doing the XIM stuffs. (Some applications just call XSetLocaleModifiers("@im=none") and ignore the XMODIFIERS setting, making a European XIM to be activated instead of Thai.) And most importantly, the application must support XIM.

However, a bug has slipped out my hand regarding the Shift key which clears the input sequence check state, making all shifted-key to be rejected. I have fixed this with another patch against XFree86 4.0.2, and the corresponding patch for XFree86 4.0.99.1.

The input method works fine for continuous text typing. But it still fails to catch up with the context when the cursor is moved. Therefore, a buffer retrieval mechanism is required for correctly determining the validity of the key. This can be done through the XNStringConversionCallback value of the X Input Context (XIC). The patch aforementioned also adds experimental code for such retrieval. And a later patch, plus another bug fix, have added capability to correct input sequences as well. However, it's also necessary to push the requirement of the callback support to toolkit and application developers. As an experiment, I have added this to the xiterm+thai 1.04pre2 terminal emulator.

However, as there are tons of applications to pursuade their developers to provide the callback, it may also be necessary for the input method to do its best to fallback nicely with the absence of such callback.

Good news is that GTK+ 2 has defined API and signals for the callback to communicate with GTK+ widgets, as well as the signals handling in GTK+ text entry widgets. It took me about two years to wade through other tasks and get back to this issue again. I've proposed a patch (plus a polishing patch) against GTK+ 2.1.3 to add String Conversion Callback to its imxim module. This bridges the gap between Thai XIM in the X library and the text entry widgets in many GTK+ 2 applications. However, some applications which define its own widgets or text entries still need to be patched so that it handle the retrieve_surrounding and delete_surrounding signals properly, such as Gnumeric 1.1.13.

For GTK+, I also make a patch to add imthai module, which is a platform-independent IM module for GTK+, to GTK+ 2.2.0.

Input Schemes

As mentioned above in the XIM section, there are four modes of input sequence check supported in original X library, three of which are standardized by the Thai API Consortium in the WTT 2.0 draft. I can't find the reference for Thaicat yet, and I would be grateful if somebody can educate me what it is.

WTT 2.0 defines only the input sequence "filtering", but not "correction". So, I have enhanced it with a set of rules on top for the correction capability.

Given:

CP(a,b): tests if a can be composed by b in the same display cell
AC(a,b): tests if b can follow a in a new display cell

The rules are as follows:


if CP(x,z) then
  if CP(z,y) then
    reorder(y -> zy)  // e.g. ก่ + -ี -> กี่
  elif CP(x,y) then
    replace(y -> z)   // e.g. กิ + -ี -> กี ; ธ์ + -ู -> ธู
  elif y is FV1 and z is TONE then
    reorder(y -> zy)  // e.g. นำ + -้ -> น้ำ ; ทา + -่ -> ท่า ;
                      //      นะ + -่ -> น่ะ
  else
    reject(z)  // e.g. กา + -ี -> กา ; กเ + -่ -> กเ
  endif
elif AC(x,z) then
  replace(y -> z)  // e.g. เ + แ -> แ ; กแ + ฤ -> กฤ ; ฤก + ๅ -> ฤๅ
else
  reject(z)  // e.g. ธุ์ + -ู -> ธุ์
endif

The behaviors of the modes are as follows:

Passthrough (WTT level 0) doesn't screen any input sequence. Users are allowed to input any illegal sequence at their own risk. And it's up to the rendering engine how to handle these sequences.
BasicCheck (WTT level 1) ensures the input sequence to conform to canonical order, which will render in general renderers and will not be ambiguous in term of text processing. For example, `ที่' will always be input as `ท' ` ี' ` ่', not `ท' ` ่' ` ี'. The key stroke that will begin an invalid sequence will be absorbed, or corrected if possible, by the input method.
Strict (WTT level 2) adds grammatical restrictions to the BasicCheck mode which can help screen out a number of typos and improve syntactic correctness of documents (well, as much as an input method can do, though). For example, LAKKHANG YAO (ๅ) is allowed only after RO RU (ฤ) or LO LU (ฦ); upper/lower vowels are not allowed after leading vowels, etc. The key stroke that will begin an invalid sequence will be absorbed, or corrected if possible, by the input method.

Resource Summary

Resources for XFree86 4.0.1d

Alternative Thai XKB map with three keysyms set for some uninternationalized apps
by ]d

Patches against XFree86 4.0.1f

XIM Patch #1 to fix keyboard translation problem
by Theppitak Karoonboonyanan

Patches against XFree86 4.0.2

TIS620 Patch #1 to change character set encoding from TIS620.2533 to TIS620
by Chanop Silpa-Anan
XIM Patch #1 to fix <Shift> key problem and to experiment with XNStringConversionCallback feature (Cancelled)
by Theppitak Karoonboonyanan
XIM Patch #2 to fix <Shift> key problem and to experiment with XNStringConversionCallback feature
by Theppitak Karoonboonyanan

Patches against XFree86 4.0.99.1

XIM Patch #1 to fix <Shift> key problem and to experiment with XNStringConversionCallback feature
by Theppitak Karoonboonyanan

Patches for xfsft

ttmkfdir patch to add Windows and MacThai extension check
by Theppitak Karoonboonyanan
tis620-2.enc for your /usr/X11R6/lib/X11/fonts/encodings/tis620-2.enc

Patches for XFree86 CVS snapshots between 4.2.1-4.3.0

ISO-8859-11 XLC to add th_TH.ISO-8859-11 X locale
XIM Patch #1 to fix bug that dysfunctions <Ctrl-key> keys
UTF-8 XLC to add th_TH.UTF-8 X locale, as well as UTF-8 XIM
XIM Patch #2 to fix String Conversion Callback protocol, and to add input sequence correction capability
XIM Patch #3 to fix bug in previous patch which prevent apps without String Conversion Callback from falling back gracefully
TIS XKB to add th_tis (TIS-820.2538) keyboard map

Patches for GTK+ 2.2.0

imxim to add String Conversion Callback in GTK+ imxim module
additional polishing patch for imxim module
imthai to add imthai module

Patches for Gnumeric 1.1.13

surrounding signals handling to add handles for retrieve_surrounding and delete_surrounding signals in GnmCanvas