Internationalization, aka. I18N, is a provision within a computer program of the capability of making itself adaptable to the requirements of different native languages, local customs and coded character sets. By means of internationalization, software can adapt to any local culture without re-compilation by relying on the abstract set of conventions, whose detail is left to localization process.
A locale definition fills the slot by describing its particular cultural conventions. There have been several kinds of locale defined on different systems, such as POSIX locale for basic cultural conventions, ISO/IEC 14652 as a POSIX enhancement, X locale for X Window applications, CTL locale for complex text layout definition, etc.
POSIX (Portable Operating System Interface for Computing Environments) is an international standard which defines interface between application and operating system. It boosts up software portability as the interface is guaranteed to be applicable to any conforming OS.
One category defined in POSIX is internationalization. And the term locale is used to call the set of language and cultural rules.
Each locale is named for the language, territory, and character code set it describes. The following format is used for naming a locale:
language[_territory][.codeset]
The language part must be one of the language names defined in ISO 639:1988, and the territory part is one of the Alpha-2 code elements defined in ISO 3166-1. For example, the French language spoken in Canada using ISO-8859-1 code set is fr_CA.ISO-8859-1. The fr stands for the French language and the CA stands for Canada. If the codeset is omitted, implementation-dependent locale default encoding is implied.
The locale describing Thai language spoken in Thailand using TIS620 code set, which is explained in this article, is called th_TH.TIS-620, or in short, th_TH.
A number of Standard C library functions can be adapted to local culture. That means a program written in C can be internationalized at some degree. If not specified, the locale defaults to POSIX or C, in which everything works as defined in the C standard. Otherwise, the setlocale() function call determines the locale to use. In general, a blank string is passed to the function and the environment variables LANG and LC_* are determined as follows:
LANG | Default value for unspecified categories |
LC_ALL | If defined, override all locale categories |
LC_COLLATE | The name of the locale for collation information |
LC_CTYPE | The name of the locale for character classification |
LC_MONETARY | The name of the locale for money related information |
LC_NUMERIC | The name of the locale for numeric editing |
LC_TIME | The name of the locale for date- and time-formatting information |
LC_MESSAGES | The name of the locale for messages |
<locale.h> char* setlocale(int category, const char* locale); Example : const char* pPrevLocale = setlocale(LC_ALL, ""); sets current locale to what specified in LANG or LC_ALL environment variable.
<ctype.h> int iscntrl(int c); int isgraph(int c); int isprint(int c); int isspace(int c); int ispunct(int c); int isalnum(int c); int isalpha(int c); int isdigit(int c); int isxdigit(int c); int islower(int c); int isupper(int c); int tolower(int c); int toupper(int c);
<string.h> int strcoll(const char* s1, const char* s2); size_t strxfrm(char* dest, const char* src, size_t n);
<time.h> size_t strftime(char* s, size_t maxsize, const char* format, const struct tm* tp);
<locale.h> struct lconv { /* Numeric (non-monetary) information. */ char *decimal_point; /* Decimal point character. */ char *thousands_sep; /* Thousands separator. */ /* Each element is the number of digits in each group; elements with higher indices are farther left. An element with value CHAR_MAX means that no further grouping is done. An element with value 0 means that the previous element is used for all groups farther left. */ char *grouping; /* Monetary information. */ /* First three chars are a currency symbol from ISO 4217. Fourth char is the separator. Fifth char is '\0'. */ char *int_curr_symbol; char *currency_symbol; /* Local currency symbol. */ char *mon_decimal_point; /* Decimal point character. */ char *mon_thousands_sep; /* Thousands separator. */ char *mon_grouping; /* Like `grouping' element (above). */ char *positive_sign; /* Sign for positive values. */ char *negative_sign; /* Sign for negative values. */ char int_frac_digits; /* Int'l fractional digits. */ char frac_digits; /* Local fractional digits. */ /* 1 if currency_symbol precedes a positive value, 0 if succeeds. */ char p_cs_precedes; /* 1 iff a space separates currency_symbol from a positive value. */ char p_sep_by_space; /* 1 if currency_symbol precedes a negative value, 0 if succeeds. */ char n_cs_precedes; /* 1 iff a space separates currency_symbol from a negative value. */ char n_sep_by_space; /* Positive and negative sign positions: 0 Parentheses surround the quantity and currency_symbol. 1 The sign string precedes the quantity and currency_symbol. 2 The sign string follows the quantity and currency_symbol. 3 The sign string immediately precedes the currency_symbol. 4 The sign string immediately follows the currency_symbol. */ char p_sign_posn; char n_sign_posn; }; struct lconv* localeconv();
No standard function is defined for this category yet, although there have been several proposals, such as:
In POSIX, locales can be defined using three definition files:
The definition files must be translated into binary data to be useto be used by the standard libraries of the programming languages (that conforms to POSIX interface, such as C, Ada and Fortran). The utility program for this purpose is called localedef.
You can see an example of POSIX locale definition from a Thai locale definition. It is known to be applicable to GNU libc 2.1.1 or later.
locale - get current locale information Options: -a write names of available locales -m write names of available charmaps -c write names of selected categories -k write names of selected keywords localedef - generate and install locale definition data Options: -f FILE symbolic character names defined in FILE -i FILE source definitions in FILE -u FILE FILE contains mapping from symbolic names to UCS4 (ISO/IEC 10646 elements)
ISO/IEC 14652 is an extension from POSIX locale. In addition to the more details in each of the six categories, it adds six more categories :
All locale information can be retrieved with the nl_langinfo() function.
ISO C++ <locale> standard library is another extensible internationalization framework. In ISO C++, locale is closely tied to iostream. A locale contains a set of facet's, which is customizable by means of class derivation. The system locale categories, however, are created as the default facets by the standard library.
To set a locale to an iostream object, you use the ios_base::imbue() member function :
#include <iostream> #include <locale> void f() { std::locale loc(""); // create a locale object according // to LANG or LC_ALL environment cin.imbue(loc); // let cin use loc // ... cin.imbue(std::locale::global()); // reset cin to use the default locale }
The key to understanding C++ locale is the facet. A facet is an interface to a service provided by a locale object. For example, num_put<> is a facet for formatting numeric value (according to LC_NUMERIC), collate<> facet provides string ordering (according to LC_COLLATE). A locale object indeed contains a vector of locale::facet objects.
use_facet<> ...
Example: ...
...
The primary standard for Thai information interchange is TIS 620-2533 (1990 A.D.). Most international standards regarding Thai character codes are based on this national standard. This includes ISO-IR-166, an international register of coded character set to be used with escape sequences. For more information, please see Mr. Trin Tantsetthi's An Annotated Reference to the Thai implementations.
In TIS 620-2533 code table, the lower half (0x00-0x7F) is duplicated from ISO 646, namely ASCII 7-bit code. The upper half (0x80-0xFF) is defined in conformance to ISO/IEC 2022. No character codes are defined in the CR area (0x80-0x9F). Thai character code then begins at 0xA1 (= the first Thai consonant Ko Kai).
Thai characters are assigned codes in the range U+0E00 to U+0E5F in Unicode and ISO/IEC 10646-1. The code table is equivalent to the range 0xA0-0xFF in TIS 620-2533.
Thai character set is composed of Thai consonants, Thai vowel, Thai diactritics, Thai tone marks, Thai punctuations, and Thai digits. There is no case for Thai alphabets. Therefore, the subcategories regarding cases are the same as those of POSIX. This is also true for hexadecimal digits, space characters and control characters.
upper | same as POSIX |
lower | same as POSIX |
alpha | Roman upper/lower cases + Thai consonants + Thai vowels + LAKKHANGYAO + MAITAIKHU + Thai tone marks + THANTHAKHAT |
digit | Arabic digits 0-9 |
xdigit | same as POSIX |
space | same as POSIX |
cntrl | same as POSIX |
punct | ASCII punctuations + PAIYANNOI + BAHT + MAIYAMOK + FONGMAN + ANGKHANKHUKHOMUT |
graph | upper + lower + alpha + digit + xdigit + punct + PAIYANNOI + BAHT + MAIYAMOK + FONGMAN + ANGKHANKHU + KHOMUT + PINTHU + NIKHAHIT + YAMAKKAN |
0x20 + upper + lower + alpha + digit + xdigit + punct + PAIYANNOI + BAHT + MAIYAMOK + FONGMAN + ANGKHANKHU + KHOMUT + PINTHU + NIKHAHIT + YAMAKKAN | |
blank | same as POSIX |
toupper | same as POSIX |
tolower | same as POSIX |
Thai string collation principle can be found in the Royal Institute Dictionary 2525 B.E. Edition (1982 A.D.), the official standard dictionary for Thai language. And TIS 620-2533 has been defined for easy sorting according to this principle. The principles are:
The ordering in this locale definition is based on this principle, plus some additional issues which are not defined in the dictionary. According to the Royal Institute dictionary principle, there are two basic problems to solve:
However, there are topics missing from the Royal Institute principle, under the circumstance that TIS 620-2533 code table is being used. These include:
These have been discussed in the
Thai-English
Bilingual Sorting, and have been adopted in this locale definition.
2.5 LC_TIME
Thai names, as well as abbreviations, for the months and the days of week are defined in this category. The official calendar used in Thailand is solar calendar, in accordance with Gregorian calendar, with the exception that Buddhist Era (1 B.E. = 542 B.C.) is used instead of Anno Domini. Thais always express dates in day, month, year order, both in full and abbreviated forms. Regarding time format, the 24-hour format is usually used in Thailand. AM/PM is normally used only in the displays where there is no surrounding text.
Unfortunately, the era definition is not a part of LC_TIME in POSIX, although some implementations have taken such enhancement from ISO/IEC 14652. Hence, we cannot define Buddhist Era in the context of POSIX yet.
abbreviated days (%a) | ÍÒ, ¨, Í, ¾, ¾Ä, È, Ê |
days (%A) | ÍÒ·ÔµÂì, ¨Ñ¹·Ãì, Íѧ¤ÒÃ, ¾Ø¸, ¾ÄËÑʺ´Õ, ÈØ¡Ãì, àÊÒÃì |
abbriviated months (%b) | Á.¤., ¡.¾., ÁÕ.¤., àÁ.Â., ¾.¤., ÁÔ.Â., ¡.¤., Ê.¤., ¡.Â., µ.¤., ¾.Â., ¸.¤. |
months (%B) | Á¡ÃÒ¤Á, ¡ØÁÀҾѹ¸ì, ÁÕ¹Ò¤Á, àÁÉÒ¹, ¾ÄÉÀÒ¤Á, ÁԶعÒ¹, ¡Ã¡®Ò¤Á, ÊÔ§ËÒ¤Á, ¡Ñ¹ÂÒ¹, µØÅÒ¤Á, ¾ÄȨԡÒ¹, ¸Ñ¹ÇÒ¤Á |
appropriate date & time format (%c) |
%a %e %b %Y, %H:%M:%S e.g. ÍÒ. 13 ÁÔ.Â. 2542, 22:48:32 |
appropriate date format (%x) |
%d/%m/%Y e.g. 13/06/2542 |
appropriate time format (%X) |
%H:%M:%S e.g. 22:48:32 |
AM/PM sign (%p) | AM; PM |
appropriate 12-hour clock format (%r) |
%I:%M:%S %p e.g. 10:48:32 PM |
In Thailand, numbers are written by separating the integral part and the fractal part by a period. The integral part is grouped by three digits, separated by commas. For example: the speed of light in vacuum is 299,792.458 km/s.
decimal point | <period> ( . ) |
thousands separator | <comma> ( , ) |
grouping | 3 |
Thai currency is Baht. The negative sign is minus, while the positive sign is not shown. For example, THB 1,234.00 is the international form for ß 1,234.00.
intertational currency symbol | "THB " (as per ISO/IEC 4217:1981) |
currency symbol | ß (U0E3F) |
monetary decimal point | <period> ( . ) |
monetary thousands separator | <comma> ( , ) |
monetary grouping | 3 |
positive sign | none |
negative sign | <hyphen> ( - ) |
international fractal digits | 2 |
fractal digits | 2 |
positive currency sign precedes | 1 (the currency sign precedes the nonnegative monetary quantity) |
positive separated by space | 2 (space separates currency symbol and the positive sign --which is null) |
negative currency sign precedes | 1 (the currency sign precedes the negative monetary quantity) |
negative separated by space | 2 (space separates currency symbol and the negative sign) |
positive sign position | 4 (immediately follows the currency symbol) |
negative sign position | 4 (immediately follows the currency symbol) |
All that have been settled for LC_MESSAGES in POSIX standard is yes/no strings and responses. Other messages internationalization are mostly done according to particular applications, not in the system default locale definition.
In this th_TH locale definition, the text strings for yes and no have been translated into Thai. And the initial consonants of the first syllables of the Thai translations are used for defining the positive and negative response from the users. However, due to the popularity of English usage in this case, Y and N are also accepted.
yes expression | ^[Yyª] (begins with 'Y', 'y' or Cho Chang) |
no expression | ^[NnÁ] (begins with 'N', 'n' or Mo Ma) |
yes string | "ãªè" |
no string | "äÁèãªè" |
The th_TH locale has already been incorporated with glibc, and is normally pre-built in some Linux distributions, such as RedHat and Mandrake. Debian has its own configuration method by editing /etc/locale.gen file and invoking locale-gen command, and only listed (i.e. uncommented) locales will be generated.
However, in the lowest level, the command to generate th_TH locale is
localedef -f TIS-620 -i th_TH th_TH
The TIS-620 charmap can be found at /usr/share/i18n/charmaps and the th_TH locale definition at /usr/share/i18n/locales. The generated locale files are located at /usr/lib/locale/th_TH.
Mr. Trin Tantsetthi has launched the Thai Locale Project with the cooperation of Mr. Samphan Raruenrom, Mr. Pruet Boonma and other Thai developers over the internet. I happenned to hear their conversation and got motivated to draft a definition for Thai LC_COLLATE, for it is a formal way to describe how Thai strings are ordered. Having tried to describe the ordering algorithm in words in a few articles, I found the LC_COLLATE specification what I had longed for. It is more precise, clearer, and more effective to be applied and tested.
The documents on standards are supported by the Thai Locale Project web site. The TIS-620 charmap, and subsequently the mnemonic.th repertoiremap, is supplied by Mr. Trin and Mr. Pruet. The LC_MONETARY was prepared by Mr.Trin.
The formal description, like other formulation, then needed the coverage of all aspects of the problem. Then, with the urge from Mr. Pruet's Thai sorting support for database servers project, and with the lack of Thai sorting standard for us to rely on, we, Mr. Samphan, Mr. Pruet and I, have brainstormed to create a specification for it. After the hot discussion, we ended up in a sorting principle for all strings encoded in TIS 620-2533. And I just bare the duty of creating the web site for the summary. The ideas are not all mine.
After the LC_COLLATE draft, Dr. Thaweesak Koanantakool and Dr. Virach Sornlertlamvanich, my chiefs at NECTEC, have both inspired me in drafting the rest categories for POSIX locale. This has much been supported by Mr. Pattara Kiatisevi by including it in the Thai Linux Working Group project.
Last, but not least, I owe to Mr. Ulrich Drepper of the GNU libc project for his help. He was kind enough to accept my th_TH locale definition for including in the GNU libc. Before that, he had to work out the problems of the uncommon requirements of Thai locale, and made GNU libc accept th_TH locale at last.
This locale definition is BY NO MEANS A STANDARD. It is provided in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Copyright © 1999 by Theppitak Karoonboonyanan, Software and Language Engineering Laboratory, National Electronics and Computer Technology Center. All right reserved.