Native Thai character set is 8-bit code set (TIS-620), and tons of existing text files and exchanged e-mails are encoded in TIS-620. And the character set has been supported by GNU C library th_TH locale since 2.1.1 version.
However, as the world is moving toward the UCS (ISO/IEC 10646-1), X begins to support UCS in many aspects. As Thai characters have been covered since the very first effort by Unicode Consortium, Thai has passed the stage of character encoding definition for multilingual supports.
TIS-620 is the national standard for Thai character set for use in computers. It is an extension to ISO 646, a 7-bit code set which is very close to ASCII. TIS-620 was first published in 1986 (2529 B.E.) and was amended in 1990 (2533 B.E.) in some details for the conformance with ISO/IEC 2022, but the code table was still the same.
TIS-620 has been supported by GNU C library under the th_TH locale. The iconv(3) can be used to convert between TIS-620 and other encodings.
In X11R6, TIS-620 has been defined as TIS620.2533 in the th_TH.TACTIS locale. In fact, TACTIS is an extension to TIS-620 with the word break character (0xDC) proposed by the Thai API Consortium. But that has never been adopted anywhere, and no existing application supports the word break character. (We have the Zero-Width Space [ZWSP U+200B] equivalence in the ISO/IEC 10646-1 table.) Therefore, we propose to use th_TH.TIS-620 locale instead. Moreover, the equivalence of TIS620.2529 and TIS620.2533 in the code table has caused confusion among font vendors and applications. So, we propose to remove the Buddhish Era extension and use just TIS620 instead.
Chanop Silpa-Anan has provided a patch against XFree86 4.0.2 for the code set change.
The 8-bit nature of the TIS-620 code table is very similar to those for western scripts, which are defined in ISO/IEC 8859. Therefore, TISI has tried to promote the TIS-620 code table to the international industry by proposing to encode a Latin/Thai part in ISO/IEC 8859 with the same table. The table was assigned as part 11 of the standard. Although once rejected because of the combining characters in Thai script which make it different from other Latin scripts, it has been reactivated and proclaimed as international standard in December 2000. I have kept an old link to its FCD, which you may study, but it can't be reference, anyway.
In essense, ISO-646 + TIS-620 + 0xA0 = ISO-8859-11, and the new standard has become known in more internationalized applications. Therefore, we will support both encodings in parallel, and expect ISO-8859-11 to be our future encoding.
GNU C library 2.3 has already supported th_TH.ISO-8859-11 locale. In addition, I have made a patch to add th_TH.ISO8859-11 locale to XFree86 4.2.1.
Many applications now have moved toward multilingual support, in which the core character set standard is ISO/IEC 10646 (Universal Multi-octet Coded Character Set - UCS). If you are familiar with Unicode, it is essentially the same. Unicode Consortium was founded before ISO/IEC adopted its job and formed a working group to make it an international standard. Now Unicode Consortium is sitting in the committee to propose drafts and votes with other delegates from member countries.
For X, there is a high momentum of the UCS support. The LI18NUX has promoted the use of UTF-8 locales provided by GNU C library. In the X itself, there have been multibyte and wide-char supports in X library. xterm now supports UTF-8 and some multilingual text rendition. Markus Kuhn has prepared a set of BDF UCS fonts that contains characters of as many languages as possible (Thai included), which has been included in XFree86 4.0. The Pango rendering engine of the GTK+ toolkit has provided multilingual text rendering for many languages based on UCS encoding. Qt has also supported UCS since its 2.0 version.
For more information about the use of UCS in Linux, please visit Markus Khun's UTF-8 and Unicode FAQ for Unix/Linux.
Thai has been allocated in the range of 0x0E00-0x0E7F of Unicode, and hence ISO/IEC 10646-1, with the same layout as the TIS-620 national standard.
For more information on Thai character sets, please visit Trin Tantsetthi's An annotated reference to the Thai implementations.
I have made a patch to add th_TH.UTF-8 locale, as well as UTF-8 support in Thai XIM, to XFree86 4.2.1. Note that this patch requires another XIM patch to be applied first.
This section will discuss various encodings which can be applied to make applications display Thai texts on X Window applications.
Some European-centric applications have no any idea about Thai character
sets, and in principle, should not be able to display Thai at all. However,
we still need to browse Thai web pages on X, and that capability is so
mandatory for Thai users who migrate from other OS. To achieve this, with the
similar 8-bit nature of Latin and Thai character sets, we can deceive those
applications by using Thai font under the name of iso8859-1 encoding. This
can be done by creating alias entries in fonts.alias file in your
font directory, like this:
-thai-fixed-medium-r-normal--14-100-100-100-m-70-iso8859-1 -thai-fixed-medium-r-normal--14-100-100-100-m-70-tis620-0
Applications which can be deceived this way include:
There have been names found to be used as Thai character encodings:
However, with the need of adjustments for elegant display and typesetting of Thai text, coined Shaping in developer's language, there has been a convention set for naming different Thai code sets in XLFD, as follows:
Please note the omission of .2533 and .2529 Buddhist Era extensions to prevent the confusion, because the two versions of standard are essentially the same.
We are working out so that new implementations adopt this convention for Thai text rendering, although the old font sets might still be supported for backward compatibility. The implementation known to support this convention for now is the Pango engine of GTK+ project.
I have provided a patch for ttmkfdir to generate tis620-[012] entries for the fonts.scale file from Thai TrueType fonts.
Markus Kuhn has prepared a set of BDF UCS fonts that contains characters of as many languages as possible (Thai included), which has been included in XFree86 4.0.
Modern TrueType fonts for Windows are now encoded in Unicode, although most Thai TrueType fonts for MacIntosh are still based on 8-bit mapping. Saying that a TrueType font is Unicode means its glyphs are indexed with the same code values as their corresponding characters, so that a Unicode string can be directy rendered without further mapping. MacIntosh fonts are said to be based on 8-bit mapping although their glyphs are internally indexed by 16-bit integers because those indices are not equal to their Unicode character codes. What appears to external world is their mapping tables that map 8-bit characters to their ad hoc 16-bit glyph indices. Therefore, it can be said that Windows TrueType fonts now support ISO/IEC 10646-1, and can be used in XFree86 via its TrueType support, while those in MacIntosh are not necessarily applicable, until their new MacOS versions fully support Unicode.
Before XFree86 4.0, native fonts for X were mainly BDF and Postscript. However, many efforts have been made to use TrueType fonts via the X font server (xfs), for example, xfstt TrueType renderer by Herbert Dürr, xfsft by Mark Leisher and Juliusz Chroboczek, and X-TrueType (or X-TT) by Takuya SHIOZAKI. xfstt renders TrueType fonts by itself, while the other two employs the FreeType library. X-TT was designed especially for East Asian fonts.
XFree86 4.0 has incorporated the xfsft (as the freetype module) and X-TT (as the xtt module) for TrueType font support. So, one can use TrueType fonts either via the XFree86 X server with one of the above modules loaded, or via the X font server.
In most TrueType fonts nowadays, glyphs are indexed using Unicode, so as to be able to contain glyphs of many languages in one font, and to reserve area for glyph variations for some languages. Therefore, there needs to be some character mapping mechanism from the local character set in use to the corresponding Unicode glyphs. (The map from local character set into the ISO/IEC 10646-1 table is called repertoire map.) This mapping is essential for the support of the so-called legacy character sets in Unicode-based applications, including the TrueType font servers.
xfsft hardwires internal conversion tables for iso10646-1, iso8859-<n> for n = 1 to 10 and 15, koi8-r, koi8-u, koi8-ru, koi8-uni, and koi8-e for all scalable fonts (Type1, Speedo, TrueType); microsoft-symbol for TrueType; and apple-roman for Apple TrueType. Apart from these, one can create a repertoire map for their own character set under the /usr/X11R6/lib/X11/fonts/encodings directory. (See the README file in the directory for the description on the structure of the .enc files, and ISO8859-11.enc there for TIS-620 repertoire map.) xfsft will find these files via the encodings.dir in the font directory (just similar to fonts.dir in font listing).
To add a new repertoire map, you put your .enc in the /usr/X11R6/lib/X11/fonts/encodings directory. Then, to make your TrueType fonts mapped when rendering text of your new encoding, go to the directory that contains your TrueType fonts and do the following steps:
4 angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-iso8859-1 angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-iso8859-11 angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-tis620-0 angsa.ttf -monotype-Angsana New-medium-r-normal--0-0-0-0-p-0-tis620-2This step may be automated using Jöerg Pomnitz' ttmkfdir utility, if your encoding is supported by it:
$ ttmkfdir > fonts.scaleOr, as an alternative in XFree86 4.3.0, you can also use the mkfontscale command:
$ mkfontscale
$ mkfontdir -e /usr/X11R6/lib/X11/fonts/encodings
To use TrueType fonts in your XFree86 4.x, you can use one of these methods:
clone-self = on use-syslog = off catalogue = /usr/X11R6/lib/X11/fonts/TrueType/,/usr/share/fonts/TrueType/ error-file = /usr/X11R6/lib/X11/fs/fs-errors # in decipoints default-point-size = 120 default-resolutions = 75,75,100,100
# xfs -port 7100 &or by using your distribution's init.d script, like this for Debian:
# /etc/init.d/xfs restartor for RedHat:
# /etc/rc.d/init.d/xfs restart
$ xset +fp tcp/localhost:7100 $ xset fp rehashor to add it permanently, add this line to your /etc/X11/XF86Config file:
Section "Files" ... FontPath "tcp/localhost:7100" ... EndSection
XKB is the keyboard map for X. It is used in converting key strokes to key symbols defined in <X11/keysymdefs.h>, with language switching capability. Another X routine converts the key symbols into corresponding character codes.
You can set the XKB map on the fly using the setxkbmap command. For example, to use Thai keymap, you type this in an X terminal:
$ setxkbmap thTo set it permanently, you can set this line in your /etc/X11/XF86Config file:
Section "Keyboard" ... Option "XkbLayout" "th" ... EndSectionor use XF86Setup or xf86config utility, in the keyboard section.
However, beginning in XFree86 4.3.0, the default XKB rule allows keymap composing from 2-4 maps, each with a single group. Therefore, to use Thai-English bilingual keyboard, you need to compose the us keymap and th together, and may define your group toggle key using XKB option:
$ setxkbmap us,th -option grp:alt_shift_toggleor, in /etc/X11/XF86Config:
Section "Keyboard" ... Option "XkbLayout" "us,th" Option "XkbOptions" "grp:alt_shift_toggle" ... EndSection
Thai XKB map has been introduced into X Window since X11R6. The map (/usr/X11R6/lib/X11/xkb/symbols/th) defined two maps, one for English, and the other for Latin1 with the same numerical values as TIS-620. This had made Thai keyboard input in X work fine for a while, although it is kind of Latin hack and the keysyms defined for Thai were not used at all.
Pablo Saratxaga has contributed the Thai XKB map using the Thai keysyms in XFree86 4.0.1d. This is the point that triggered later fixings which make Thai input system more complete. By the time the new XKB map was used, many uninternationalized applications failed to accept Thai keys. So, ]d has provided an XKB map with 3 maps, one for English, another for Latin1-equivalence of TIS-620, and the last one for genuine Thai keysyms. This map had well served Thai keyboard input during the transition stage, although it was not actually incorporated in XFree86.
In addition, I have provided a patch to add th_tis (TIS-820.2538) XKB map to XFree86 4.3.0 as well.
Technically speaking, Thai keyboard translation using the XKB map with Thai keysyms needs the locale to be set to th_TH (by both LANG/LC_CTYPE environment setting and applicaion's setlocale() calling), and the XmbLookupString() call in translation (instead of the old XLookupString()). With the new Thai XIM in XFree86 4.0.1g, the translation will be done in the XFilterEvent() call. (See XIM below.)
Thai XIM has been implemented using the library model in the X library. But it did't work with the new features later introduced in X. With the Thai POSIX locale introduced in GNU C library, the XIM has been activated and appears to have some defects in the key translation. Pruet Boonma raised a notice of the Thai XIM support in X library and the need of the fixing. However, this has not been fixed for a long time, and an immediate solution by setting LC_CTYPE to C has been employed during the period.
This has been raised into discussion in Thai Linux Working Group again later. ]d had done an initial analysis before asking me if I was interested in working it out. And I said yes.
The result was a patch against XFree86 4.0.1f which was checked-in and active in XFree86 4.0.2. To use Thai XIM, you need to do the followings:
$ export LC_CTYPE=th_TH(Note that if LC_CTYPE is not set, LC_ALL and LANG will be checked and used, respectively.)
$ export XMODIFIERS="@im=BasicCheck"selects the basic sequence checking method.
Note that the above two steps will be successful only by the cooperation of the application and the system setting. The first step needs the application to call setlocale(LC_CTYPE, ""), the C library to provide th_TH locale, the X Window to provide th_TH X locale with the proper Thai font set (both in the XLC_LOCALE description and in the hard-wired code, which needs Chanop's patch to be applied if you want to use TIS620 font set), and (optionally) the Thai fonts with the corresponding font set to be installed. The second step needs the application to call XSetLocaleModifiers("") before doing the XIM stuffs. (Some applications just call XSetLocaleModifiers("@im=none") and ignore the XMODIFIERS setting, making a European XIM to be activated instead of Thai.) And most importantly, the application must support XIM.
However, a bug has slipped out my hand regarding the Shift key which clears the input sequence check state, making all shifted-key to be rejected. I have fixed this with another patch against XFree86 4.0.2, and the corresponding patch for XFree86 4.0.99.1.
The input method works fine for continuous text typing. But it still fails to catch up with the context when the cursor is moved. Therefore, a buffer retrieval mechanism is required for correctly determining the validity of the key. This can be done through the XNStringConversionCallback value of the X Input Context (XIC). The patch aforementioned also adds experimental code for such retrieval. And a later patch, plus another bug fix, have added capability to correct input sequences as well. However, it's also necessary to push the requirement of the callback support to toolkit and application developers. As an experiment, I have added this to the xiterm+thai 1.04pre2 terminal emulator.
However, as there are tons of applications to pursuade their developers to provide the callback, it may also be necessary for the input method to do its best to fallback nicely with the absence of such callback.
Good news is that GTK+ 2 has defined API and signals for the callback to communicate with GTK+ widgets, as well as the signals handling in GTK+ text entry widgets. It took me about two years to wade through other tasks and get back to this issue again. I've proposed a patch (plus a polishing patch) against GTK+ 2.1.3 to add String Conversion Callback to its imxim module. This bridges the gap between Thai XIM in the X library and the text entry widgets in many GTK+ 2 applications. However, some applications which define its own widgets or text entries still need to be patched so that it handle the retrieve_surrounding and delete_surrounding signals properly, such as Gnumeric 1.1.13.
For GTK+, I also make a patch to add imthai module, which is a platform-independent IM module for GTK+, to GTK+ 2.2.0.
As mentioned above in the XIM section, there are four modes of input sequence check supported in original X library, three of which are standardized by the Thai API Consortium in the WTT 2.0 draft. I can't find the reference for Thaicat yet, and I would be grateful if somebody can educate me what it is.
WTT 2.0 defines only the input sequence "filtering", but not "correction". So, I have enhanced it with a set of rules on top for the correction capability.
Given:
The rules are as follows:
if CP(x,z) then if CP(z,y) then reorder(y -> zy) // e.g. ¡è + -Õ -> ¡Õè elif CP(x,y) then replace(y -> z) // e.g. ¡Ô + -Õ -> ¡Õ ; ¸ì + -Ù -> ¸Ù elif y is FV1 and z is TONE then reorder(y -> zy) // e.g. ¹Ó + -é -> ¹éÓ ; ·Ò + -è -> ·èÒ ; // ¹Ð + -è -> ¹èÐ else reject(z) // e.g. ¡Ò + -Õ -> ¡Ò ; ¡à + -è -> ¡à endif elif AC(x,z) then replace(y -> z) // e.g. à + á -> á ; ¡á + Ä -> ¡Ä ; Ä¡ + å -> Äå else reject(z) // e.g. ¸Øì + -Ù -> ¸Øì endif
The behaviors of the modes are as follows:
Copyright © 2001 by Theppitak Karoonboonyanan. All right reserved.