Thai Locale

Contents

  1. Locale and Internationalization
  2. Thai POSIX Locale Definition
  3. Building Thai POSIX Locale
  4. Acknowledgements
  5. Disclaimer
  6. Related Sites

1 Locale and Internationalization

Internationalization, aka. I18N, is a provision within a computer program of the capability of making itself adaptable to the requirements of different native languages, local customs and coded character sets. By means of internationalization, software can adapt to any local culture without re-compilation by relying on the abstract set of conventions, whose detail is left to localization process.

A locale definition fills the slot by describing its particular cultural conventions. There have been several kinds of locale defined on different systems, such as POSIX locale for basic cultural conventions, ISO/IEC 14652 as a POSIX enhancement, X locale for X Window applications, CTL locale for complex text layout definition, etc.

1.1 POSIX Locale and Standard C Interface

POSIX (Portable Operating System Interface for Computing Environments) is an international standard which defines interface between application and operating system. It boosts up software portability as the interface is guaranteed to be applicable to any conforming OS.

One category defined in POSIX is internationalization. And the term locale is used to call the set of language and cultural rules.

Locale Naming Conventions

Each locale is named for the language, territory, and character code set it describes. The following format is used for naming a locale:

language[_territory][.codeset]

The language part must be one of the language names defined in ISO 639:1988, and the territory part is one of the Alpha-2 code elements defined in ISO 3166-1. For example, the French language spoken in Canada using ISO-8859-1 code set is fr_CA.ISO-8859-1. The fr stands for the French language and the CA stands for Canada. If the codeset is omitted, implementation-dependent locale default encoding is implied.

The locale describing Thai language spoken in Thailand using TIS620 code set, which is explained in this article, is called th_TH.TIS-620, or in short, th_TH.

Categories in POSIX Locale

  1. LC_CTYPE - character classification
  2. LC_COLLATE - string collation
  3. LC_TIME - date/time format
  4. LC_NUMERIC - number format
  5. LC_MONETARY - currency format
  6. LC_MESSAGES - locale messages

Standard C Programming with Locale

A number of Standard C library functions can be adapted to local culture. That means a program written in C can be internationalized at some degree. If not specified, the locale defaults to POSIX or C, in which everything works as defined in the C standard. Otherwise, the setlocale() function call determines the locale to use. In general, a blank string is passed to the function and the environment variables LANG and LC_* are determined as follows:

LANG Default value for unspecified categories
LC_ALL If defined, override all locale categories
LC_COLLATE The name of the locale for collation information
LC_CTYPE The name of the locale for character classification
LC_MONETARY The name of the locale for money related information
LC_NUMERIC The name of the locale for numeric editing
LC_TIME The name of the locale for date- and time-formatting information
LC_MESSAGES The name of the locale for messages
Locale Setting
    <locale.h>
    char* setlocale(int category, const char* locale);

    Example :

    const char* pPrevLocale = setlocale(LC_ALL, "");

      sets current locale to what specified in LANG or LC_ALL environment
      variable.
LC_CTYPE
    <ctype.h>
    int iscntrl(int c);
    int isgraph(int c);
    int isprint(int c);
    int isspace(int c);
    int ispunct(int c);
    int isalnum(int c);
    int isalpha(int c);
    int isdigit(int c);
    int isxdigit(int c);
    int islower(int c);
    int isupper(int c);
    int tolower(int c);
    int toupper(int c);
LC_COLLATE
    <string.h>
    int strcoll(const char* s1, const char* s2);
    size_t strxfrm(char* dest, const char* src, size_t n);
LC_TIME
    <time.h>
    size_t strftime(char* s, size_t maxsize,
                    const char* format, const struct tm* tp);
LC_NUMERIC, LC_MONETARY
    <locale.h>

    struct lconv
    {
      /* Numeric (non-monetary) information.  */
    
      char *decimal_point;		/* Decimal point character.  */
      char *thousands_sep;		/* Thousands separator.  */
      /* Each element is the number of digits in each group;
         elements with higher indices are farther left.
         An element with value CHAR_MAX means that no further grouping is done.
         An element with value 0 means that the previous element is used
         for all groups farther left.  */
      char *grouping;
    
      /* Monetary information.  */
    
      /* First three chars are a currency symbol from ISO 4217.
         Fourth char is the separator.  Fifth char is '\0'.  */
      char *int_curr_symbol;
      char *currency_symbol;	/* Local currency symbol.  */
      char *mon_decimal_point;	/* Decimal point character.  */
      char *mon_thousands_sep;	/* Thousands separator.  */
      char *mon_grouping;		/* Like `grouping' element (above).  */
      char *positive_sign;		/* Sign for positive values.  */
      char *negative_sign;		/* Sign for negative values.  */
      char int_frac_digits;		/* Int'l fractional digits.  */
      char frac_digits;		/* Local fractional digits.  */
      /* 1 if currency_symbol precedes a positive value, 0 if succeeds.  */
      char p_cs_precedes;
      /* 1 iff a space separates currency_symbol from a positive value.  */
      char p_sep_by_space;
      /* 1 if currency_symbol precedes a negative value, 0 if succeeds.  */
      char n_cs_precedes;
      /* 1 iff a space separates currency_symbol from a negative value.  */
      char n_sep_by_space;
      /* Positive and negative sign positions:
         0 Parentheses surround the quantity and currency_symbol.
         1 The sign string precedes the quantity and currency_symbol.
         2 The sign string follows the quantity and currency_symbol.
         3 The sign string immediately precedes the currency_symbol.
         4 The sign string immediately follows the currency_symbol.  */
      char p_sign_posn;
      char n_sign_posn;
    };

    struct lconv* localeconv();
LC_MESSAGES

No standard function is defined for this category yet, although there have been several proposals, such as:

POSIX Locale Definition

In POSIX, locales can be defined using three definition files:

  1. charmap describing the character set to be used, with symbolic names (in ASCII text) for reference to the characters
  2. repertoiremap describing the mapping from the symbolic names into UCS4 (ISO/IEC 10646)
  3. locale definition file describing the specification of the locale categories

The definition files must be translated into binary data to be useto be used by the standard libraries of the programming languages (that conforms to POSIX interface, such as C, Ada and Fortran). The utility program for this purpose is called localedef.

You can see an example of POSIX locale definition from a Thai locale definition. It is known to be applicable to GNU libc 2.1.1 or later.

POSIX Commands

locale    - get current locale information
            Options:
              -a   write names of available locales
              -m   write names of available charmaps
              -c   write names of selected categories
              -k   write names of selected keywords

localedef - generate and install locale definition data
            Options:
              -f FILE  symbolic character names defined in FILE
              -i FILE  source definitions in FILE
              -u FILE  FILE contains mapping from symbolic names to UCS4
                       (ISO/IEC 10646 elements)

1.2 ISO/IEC 14652 - Specifications for Cultural Conventions

ISO/IEC 14652 is an extension from POSIX locale. In addition to the more details in each of the six categories, it adds six more categories :

  1. LC_PAPER - paper size
  2. LC_NAME - personal name format
  3. LC_ADDRESS - address codes and format
  4. LC_TELEPHONE - telephone number
  5. LC_MEASUREMENT - measurement units
  6. LC_VERSIONS - locale version

All locale information can be retrieved with the nl_langinfo() function.

1.3 ISO C++ Locale Interface

ISO C++ <locale> standard library is another extensible internationalization framework. In ISO C++, locale is closely tied to iostream. A locale contains a set of facet's, which is customizable by means of class derivation. The system locale categories, however, are created as the default facets by the standard library.

Setting Locale for iostream

To set a locale to an iostream object, you use the ios_base::imbue() member function :

    #include <iostream>
    #include <locale>

    void f()
    {
        std::locale  loc("");             // create a locale object according
                                          //   to LANG or LC_ALL environment
        cin.imbue(loc);                   // let cin use loc
        // ...
        cin.imbue(std::locale::global()); // reset cin to use the default locale
    }

Locale and Facet

The key to understanding C++ locale is the facet. A facet is an interface to a service provided by a locale object. For example, num_put<> is a facet for formatting numeric value (according to LC_NUMERIC), collate<> facet provides string ordering (according to LC_COLLATE). A locale object indeed contains a vector of locale::facet objects.

Using Facets

use_facet<> ...

    Example:
    ...

Creating a New Facet

...

2 Thai POSIX Locale Definition

2.1 Charmap

The primary standard for Thai information interchange is TIS 620-2533 (1990 A.D.). Most international standards regarding Thai character codes are based on this national standard. This includes ISO-IR-166, an international register of coded character set to be used with escape sequences. For more information, please see Mr. Trin Tantsetthi's An Annotated Reference to the Thai implementations.

In TIS 620-2533 code table, the lower half (0x00-0x7F) is duplicated from ISO 646, namely ASCII 7-bit code. The upper half (0x80-0xFF) is defined in conformance to ISO/IEC 2022. No character codes are defined in the CR area (0x80-0x9F). Thai character code then begins at 0xA1 (= the first Thai consonant Ko Kai).

2.2 Repertoiremap

Thai characters are assigned codes in the range U+0E00 to U+0E5F in Unicode and ISO/IEC 10646-1. The code table is equivalent to the range 0xA0-0xFF in TIS 620-2533.

2.3 LC_CTYPE

Thai character set is composed of Thai consonants, Thai vowel, Thai diactritics, Thai tone marks, Thai punctuations, and Thai digits. There is no case for Thai alphabets. Therefore, the subcategories regarding cases are the same as those of POSIX. This is also true for hexadecimal digits, space characters and control characters.

upper same as POSIX
lower same as POSIX
alpha Roman upper/lower cases + Thai consonants + Thai vowels + LAKKHANGYAO + MAITAIKHU + Thai tone marks + THANTHAKHAT
digit Arabic digits 0-9
xdigit same as POSIX
space same as POSIX
cntrl same as POSIX
punct ASCII punctuations + PAIYANNOI + BAHT + MAIYAMOK + FONGMAN + ANGKHANKHUKHOMUT
graph upper + lower + alpha + digit + xdigit + punct + PAIYANNOI + BAHT + MAIYAMOK + FONGMAN + ANGKHANKHU + KHOMUT + PINTHU + NIKHAHIT + YAMAKKAN
print 0x20 + upper + lower + alpha + digit + xdigit + punct + PAIYANNOI + BAHT + MAIYAMOK + FONGMAN + ANGKHANKHU + KHOMUT + PINTHU + NIKHAHIT + YAMAKKAN
blank same as POSIX
toupper same as POSIX
tolower same as POSIX

2.4 LC_COLLATE

Thai string collation principle can be found in the Royal Institute Dictionary 2525 B.E. Edition (1982 A.D.), the official standard dictionary for Thai language. And TIS 620-2533 has been defined for easy sorting according to this principle. The principles are:

  1. Words are ordered alphabetically, not phonetically. (Consonants order is reflected by the range of 0xA1-0xCE in TIS 620-2533, or U+0E00-U+0E2E in ISO/IEC 10646-1.)
  2. Vowels are also ordered by written forms, not by sounds. (Vowels order is reflected by the range of 0xD0-0xD9, 0xE0-0xE4 in TIS 620-2533, or U+0E30-U+0E39, U+0E40-U+0E44 in ISO/IEC 10646-1.)
  3. Consonants always precede vowels. String comparison is performed from left to right, considering initial consonants before vowels in the same syllable.
  4. Tones and diacritics are normally ignored, unless all other parts are equal (in which case the order is reblected by the range of 0xE7-0xEB in TIS 620-2533, or U+0E47-U+0E4B in ISO/IEC 10646-1.)

The ordering in this locale definition is based on this principle, plus some additional issues which are not defined in the dictionary. According to the Royal Institute dictionary principle, there are two basic problems to solve:

  1. Leading vowels, which are written before consonants, must be considered after the initial consonant. Thus, the rearrangement is needed before actual comparison. This is accomplished by means of collating elements formation. Every possible pair of leading vowel and consonant is defined as a collating-element, whose weight equals to that of the rearranged substring.
  2. Diacritics and tone marks must be ignored in the first pass, and be considered at later pass if the first pass yields equality. This is accomplished by the multiple levels of weights as defined by the LC_COLLATE specification. Weights are designed so that diacritics and tone marks are ignored in the first level, and weigh more than all consonants and vowels in the second level.

However, there are topics missing from the Royal Institute principle, under the circumstance that TIS 620-2533 code table is being used. These include:

These have been discussed in the Thai-English Bilingual Sorting, and have been adopted in this locale definition.

2.5 LC_TIME

Thai names, as well as abbreviations, for the months and the days of week are defined in this category. The official calendar used in Thailand is solar calendar, in accordance with Gregorian calendar, with the exception that Buddhist Era (1 B.E. = 542 B.C.) is used instead of Anno Domini. Thais always express dates in day, month, year order, both in full and abbreviated forms. Regarding time format, the 24-hour format is usually used in Thailand. AM/PM is normally used only in the displays where there is no surrounding text.

Unfortunately, the era definition is not a part of LC_TIME in POSIX, although some implementations have taken such enhancement from ISO/IEC 14652. Hence, we cannot define Buddhist Era in the context of POSIX yet.

abbreviated days (%a) , , , , , ,
days (%A) ҷԵ, ѹ, ѧ, ظ, ʺ, ء,
abbriviated months (%b) .., .., .., .., .., .., .., .., .., .., .., ..
months (%B) Ҥ, Ҿѹ, չҤ, ¹, Ҥ, Զع¹, áҤ, ԧҤ, ѹ¹, Ҥ, Ȩԡ¹, ѹҤ
appropriate date & time format (%c) %a %e %b %Y, %H:%M:%S
e.g. . 13 .. 2542, 22:48:32
appropriate date format (%x) %d/%m/%Y
e.g. 13/06/2542
appropriate time format (%X) %H:%M:%S
e.g. 22:48:32
AM/PM sign (%p) AM; PM
appropriate 12-hour clock format (%r) %I:%M:%S %p
e.g. 10:48:32 PM

2.6 LC_NUMERIC

In Thailand, numbers are written by separating the integral part and the fractal part by a period. The integral part is grouped by three digits, separated by commas. For example: the speed of light in vacuum is 299,792.458 km/s.

decimal point <period> ( . )
thousands separator <comma> ( , )
grouping 3

2.7 LC_MONETARY

Thai currency is Baht. The negative sign is minus, while the positive sign is not shown. For example, THB 1,234.00 is the international form for 1,234.00.

intertational currency symbol "THB " (as per ISO/IEC 4217:1981)
currency symbol (U0E3F)
monetary decimal point <period> ( . )
monetary thousands separator <comma> ( , )
monetary grouping 3
positive sign none
negative sign <hyphen> ( - )
international fractal digits 2
fractal digits 2
positive currency sign precedes 1 (the currency sign precedes the nonnegative monetary quantity)
positive separated by space 2 (space separates currency symbol and the positive sign --which is null)
negative currency sign precedes 1 (the currency sign precedes the negative monetary quantity)
negative separated by space 2 (space separates currency symbol and the negative sign)
positive sign position 4 (immediately follows the currency symbol)
negative sign position 4 (immediately follows the currency symbol)

2.8 LC_MESSAGES

All that have been settled for LC_MESSAGES in POSIX standard is yes/no strings and responses. Other messages internationalization are mostly done according to particular applications, not in the system default locale definition.

In this th_TH locale definition, the text strings for yes and no have been translated into Thai. And the initial consonants of the first syllables of the Thai translations are used for defining the positive and negative response from the users. However, due to the popularity of English usage in this case, Y and N are also accepted.

yes expression ^[Yy] (begins with 'Y', 'y' or Cho Chang)
no expression ^[Nn] (begins with 'N', 'n' or Mo Ma)
yes string ""
no string ""

3 Building Thai POSIX Locale

The th_TH locale has already been incorporated with glibc, and is normally pre-built in some Linux distributions, such as RedHat and Mandrake. Debian has its own configuration method by editing /etc/locale.gen file and invoking locale-gen command, and only listed (i.e. uncommented) locales will be generated.

However, in the lowest level, the command to generate th_TH locale is

    localedef -f TIS-620 -i th_TH th_TH

The TIS-620 charmap can be found at /usr/share/i18n/charmaps and the th_TH locale definition at /usr/share/i18n/locales. The generated locale files are located at /usr/lib/locale/th_TH.

4 Acknowledgements

Mr. Trin Tantsetthi has launched the Thai Locale Project with the cooperation of Mr. Samphan Raruenrom, Mr. Pruet Boonma and other Thai developers over the internet. I happenned to hear their conversation and got motivated to draft a definition for Thai LC_COLLATE, for it is a formal way to describe how Thai strings are ordered. Having tried to describe the ordering algorithm in words in a few articles, I found the LC_COLLATE specification what I had longed for. It is more precise, clearer, and more effective to be applied and tested.

The documents on standards are supported by the Thai Locale Project web site. The TIS-620 charmap, and subsequently the mnemonic.th repertoiremap, is supplied by Mr. Trin and Mr. Pruet. The LC_MONETARY was prepared by Mr.Trin.

The formal description, like other formulation, then needed the coverage of all aspects of the problem. Then, with the urge from Mr. Pruet's Thai sorting support for database servers project, and with the lack of Thai sorting standard for us to rely on, we, Mr. Samphan, Mr. Pruet and I, have brainstormed to create a specification for it. After the hot discussion, we ended up in a sorting principle for all strings encoded in TIS 620-2533. And I just bare the duty of creating the web site for the summary. The ideas are not all mine.

After the LC_COLLATE draft, Dr. Thaweesak Koanantakool and Dr. Virach Sornlertlamvanich, my chiefs at NECTEC, have both inspired me in drafting the rest categories for POSIX locale. This has much been supported by Mr. Pattara Kiatisevi by including it in the Thai Linux Working Group project.

Last, but not least, I owe to Mr. Ulrich Drepper of the GNU libc project for his help. He was kind enough to accept my th_TH locale definition for including in the GNU libc. Before that, he had to work out the problems of the uncommon requirements of Thai locale, and made GNU libc accept th_TH locale at last.

5 Disclaimer

This locale definition is BY NO MEANS A STANDARD. It is provided in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

6 Related Sites


free html hit counter