A Thai Soundex System for Spelling Correction


Theppitak Karoonboonyanan1 , Virach Sornlertlamvanich1,2 , Surapant Meknavin1
1Linguistics and Knowledge Science Laboratory
National Electronics and Computer Technology Center
Ministry of Science Technology and Environment, Thailand
2Department of Computer Science
Tokyo Institute of Technology, Japan

[ NLPRS'97 conference paper (.ps) is available ] [ See also a demonstration ]

Contents


Motivation

  1. Requirement of Tolerant Search in the Royal Institute Dictionary (CD-ROM)
  2. Other Applications
[ Up | Next ]

Previous Works

Common Characteristics

[ Up | Prev | Next ]

The Problems

Some problems in the previous schemes cause wrong grouping in many cases:
  1. No difference between initial and final consonants. All but the first consonants are treated as syllable enders.
  2. Different spellings of vowel sound are not recognized. Thus, some homophones are separated.
  3. Propagation on final consonants is not covered.
  4. Variably pronounced consonants are not always determined.
[ Up | Prev | Next ]

The Proposed System

  1. The Soundex Code will keep common pronunciation features:
    1. Initial Consonant Sound (20 groups)
    2. Vowel Sound (12 groups)
    3. Final Consonant Sound (9 groups)
    and eliminate variations:
    1. Clustering
    2. Vowel Length
    3. Tone
  2. The Encoding Machine
  3. The Entry Matching Strategy
[ Up | Prev | Next ]

A Model for Soundex Encoding Machine

Suppose, for example, that the only valid syllables are:
  1. ini-cons fnl-cons
  2. ini cons 'Ò' [fnl-cons]
  3. 'à' ini-cons 'Ò'
  4. 'à' ini-cons [fnl-cons]
where
ini-con ::= cons | cons cons
cons ::= 'Ê' | '¹'
fnl-cons ::= 'Ê' | '¹'
To construct a converter, we add the outputs to the rules:
  1. ini-cons (e/'o') fnl-cons
  2. ini-cons ('Ò'/'a') (e/'Î')
  3. ini-cons ('Ò'/'a') fnl-cons
  4. ('à'/e) ini-cons ('Ò'/'a') (e/'Ç')
  5. ('à'/e) ini-cons (e/'e') (e/'Î')
  6. ('à'/e) ini-cons (e/'e') fnl-cons
where
ini-con ::= cons | cons (e/'a') (e/'Î') cons
cons ::= ('Ê'/'«') | ('¹'/'¹')
fnl-cons ::= ('Ê'/'´') | ('¹'/'¹')
Then, we construct a converter machine from the union set of all rules:

And, by substituting macros:

Here are the only two possible executions on a input transcription "àʹÒ":
(1) -à/e-> (5) -Ê/«-> (c) -e/a-> (d) -e/Î-> (e) -¹/¹-> (6) -Ò/a-> (7) -e/Ç-> (8)*
(1) -à/e-> (5) -Ê/«-> (c) -e/e-> (6) -e/e-> (4) -e/Î-> (8) -e/e-> (1) -¹/¹-> (9) -e/e-> (2) -Ò/a-> (4) -e/Î-> (8)*
which yield soundex codes "«aιaÇ" and "«eιaÎ", respectively.
[ Up | Prev | Next ]

Handling Final Consonant Propagation

In Thai, a final consonant can sometimes function as initial consonant of the next syllable.
ÍѵÃÒ = Íѵà + µÃÒ
One quick solution is to provide a short-circuit machanism: To express this mechanism in the rules, we introduce two new notations:
TO(label; in/out) at current position, branch an (in/out) edge to label state.
TI(label) current position is reachable from label state by an (e/e) edge.
For example, suppose that the original rules are:
  1. ini-cons (' Ñ'/'a') fnl-cons
  2. ini-cons ('Ò'/'a') (e/'Î')
where
ini-cons ::= ('Í'/'Í') | ('µ'/'µ') | ('µÃ'/'µ')
fnl-cons ::= ('µ'/'´') | ('µÃ'/'´')
Here the word 'ÍѵµÒ' is valid, but 'ÍѵÃÒ' is still not. However, with the short-circuit notation, we modify the rules like this:
  1. ini-cons TI(P) (' Ñ'/'a') fnl-cons
  2. ini-cons TI(P) ('Ò'/'a') (e/'Î')
where
ini-cons ::= ('Í'/'Í') | ('µ'/'µ') | ('µÃ'/'µ')
fnl-cons ::= ('µ'/'´') TO(P; e/'µ') | ('µÃ'/'´') TO(P; e/'µ')
which yields the following machine:

The execution of 'ÍѵÃÒ', which generates the corresponding code 'Ía´µaÎ', is:

(1) -Í/Í-> (2) - Ñ/a-> (3) -µÃ/´-> (6) -e/µ-> (P) -e/e-> (2) -Ò/a-> (4) -e/Î-> (7)*
[ Up | Prev | Next ]

Construction of the Conversion Rules

Thai syllables are composed of five components:
  1. Initial Consonants
  2. Vowel
  3. Final Consonants
  4. Tone
  5. Cancelled Letters

Initial Consonants

Final Consonants

[ Up | Prev | Next ]

Entry Matching

Two phases for a soundex system:
  1. Preparation.
  2. Query Processing.
The new key could be indexed by using trie. Then, the encoding machine could be directly cascaded to the trie. There are two advantages:
[ Up | Prev | Next ]

Evaluation

An Experimental System

The Test

The Results

Comparison on Similar Words Grouping:
Words [Udom83] [LK82] [Arun91] Our Model
ÊÃÃ¤ì  Ê300000  «0000  «¹  «a¹ 
Êѹ Ê300000 «0000 «¹ «a¹, «a¹¹aÎ
ºØ­­Ò 
 
º330000* 
 
ºE490 
 
ºÙ¹Ò* 
 
ºu¹ÂaÎ, ºuÎÂo¹ÂaÎ, ºu¹ÂaÎÂaÎ, ºuÎÂcÎÂaÎ, ºuÎÂaÎÂaÎ, ºu¹ÂaÎ
ºØ³ÂÒ º320000* ºE490 ºÙ¹ÂÒ* ºu¹ÂaÎ, ºu¹¹aÎÂaÎ, ºuιcÎÂaÎ, ºuιaÎÂaÎ
ÊѺ»Ðô Ê559400* «5430* «ºº¹´* «aº»aÎÅo´, «aººaλaÎÅo´´aÎ, «aº»aÎÅo´´aÎ, «aººaλaÎÅcδcÎ, «aººaλaÎÅaδcÎ, «aº»aÎÅcδcÎ, «aº»aÎÅaδcÎ, «aººaλaÎÅcδaÎ, «aººaλaÎÅaδaÎ, «aº»aÎÅcδaÎ, «aº»aÎÅaδaÎ, «aººaλaÎÅo´
ÊÑ»ÃÊ Ê594000* «5300* «º¹´* «aº»aÎÅo´, «aº»aÎÅo´«aÎ, «aºÅo´«aÎ, «aº»aΫcÎ, «aº«cÎ, «aº»aÎÅcΫcÎ, «aº»aÎÅaΫcÎ, «aºÅcΫcÎ, «aºÅaΫcÎ, «aº»aΫaÎ, «aº«aÎ, «aº»aÎÅcΫaÎ, «aº»aÎÅaΫaÎ, «aºÅcΫaÎ, «aºÅaΫaÎ, «aºÅo´
ä´é ´000000* ´7000* ´ä* ´aÂ
´éÒÂ ´200000* ´9700* ´ÒÂ* ´aÂ, ´aÂÂaÎ, ´aÎÂcÎ, ´aÎÂaÎ
¡éÒÇ  ¡000000  ¡9700*  ¡ÒÇ* ¡aÇ, ¡aÇÇaÎ, ¡aÎÇcÎ, ¡aÎÇaÎ
à¡éÒ ¡000000 ¡B900* ¡Òà* ¡aÇ
¨Ã ¨900000* ¨0000* ¨¹* ¨c¹, ¨cÎÅcÎ, ¨aÎÅcÎ, ¨cÎÅaÎ, ¨aÎÅaÎ
¨Í¹ ¨630000* ¨4000* ¨Í¹* ¨c¹, ¨c¹¹aÎ, ¨cÎÍo¹¹aÎ, ¨aÎÍo¹¹aÎ, ¨cιcÎ, ¨cÎÍcιcÎ, ¨aÎÍcιcÎ, ¨cÎÍaιcÎ, ¨aÎÍaιcÎ,  
¨cιaÎ, ¨cÎÍcιaÎ, ¨aÎÍcιaÎ, ¨cÎÍaιaÎ, ¨aÎÍaιaÎ, ¨cÎÍo¹, ¨aÎÍo¹
¸ÃÃÁÐ ·300000* ·6000 ·¹Á* ·aÁÁaÎ, ·a¹ÁaÎ, ·c¹ÅoÁÁaÎ, ·c¹ÅcÎÁaÎ, ·c¹ÅaÎÁaÎ, ·cÎÅcÎÅoÁÁaÎ, ·aÎÅcÎÅoÁÁaÎ, ·cÎÅaÎÅoÁÁaÎ, ·aÎÅaÎÅoÁÁaÎ, ·cÎÅcÎÅcÎÁaÎ, ·aÎÅcÎÅcÎÁaÎ, ·cÎÅaÎÅcÎÁaÎ, ·cÎÅcÎÅaÎÁaÎ, ·aÎÅaÎÅcÎÁaÎ, ·aÎÅcÎÅaÎÁaÎ, ·cÎÅaÎÅaÎÁaÎ, ·aÎÅaÎÅaÎÁaÎ
¸ÑÁÁÐ ·000000* ·6000 ·Á* ·aÁÁaÎ, ·aÁÁaÎÁaÎ

Comparison on Different Words Separation:
 
Words [Udom83] [LK82] [Arun91] Our Model
»Ñ­­Ò »330000* »4900* »¹Ò* »a¹ÂaÎ, »a¹ÂaÎÂaÎ
»Ñ¹¹Ò »330000* »4900* »¹Ò* »a¹¹aÎ, »a¹¹aιaÎ
ÊÕÊéÁ Ê400000* «3600* «Õ´Á* «iΫoÁ, «iΫoÁÁaÎ, «iΫcÎÁcÎ, «iΫcÎÁaÎ
ÊÕ´Ó Ê400000* «3600* «Õ´Á* «iδaÁ
¡éÁ ¡000000* ¡6000 ¡Á ¡oÁ, ¡oÁÁaÎ, ¡cÎÁcÎ, ¡cÎÁaÎ
à¡Á ¡000000* ¡B600 ¡Áà ¡eÁ, ¡eÁÁaÎ, ¡eÎÁcÎ, ¡eÎÁaÎ, ¡aÎÁeÎ
¡ÔèÇ ¡000000* ¡7000 ¡Ç ¡iÇ, ¡iÇÇaÎ, ¡iÎÇcÎ, ¡iÎÇaÎ
ËÔÇ Ë000000* Ë7000* ËÇ* ÎiÇ, ÎiÇÇaÎ, ÎiÎÇcÎ, ÎiÎÇaÎ
ËÑÇ Ë000000* Ë7000* ËÇ* Î$Î
¡ÃÃä¡Ã ¡319000 ¡7100* ¡¹¡ä ¡a¹¡aÂ, ¡a¹¡aÎÅaÂ, ¡a¹ÅaΡaÂ, ¡c¹ÅaΡaÂ, ¡aÎÅaÎÅaΡaÂ, ¡c¹ÅaΡaÂÅaÎ, ¡cÎÅcÎÅcΡaÎÅaÂ, ...
¡ÃÐÂÔ¡ ¡921000 ¡7100* ¡Â¡ ¡aÎÂi¡, ¡aÎÂi¡¡aÎ, ¡cÎÅaÎÂi¡¡aÎ, ¡aÎÅaÎÂi¡¡aÎ, ¡aÎÂiΡcÎ, ¡cÎÅaÎÂiΡcÎ, ¡aÎÅaÎÂiΡcÎ, ¡aÎÂiΡaÎ, ¡cÎÅaÎÂiΡaÎ, ¡aÎÅaÎÂiΡaÎ, ¡cÎÅaÎÂi¡, ¡aÎÅaÎÂi¡
¡ÇÕÂÑ¡Éì ¡021000 ¡7100* ¡Õ¡ ¡aÎÇiÎÂa¡, ¡iÎÂa¡, ¡cÎÇiÎÂa¡
¨ÕÇà ¨090000 ¨7000* ¨Õǹ ¨iÎÇc¹, ¨iÎÇcÎÅcÎ, ¨iÎÇaÎÅcÎ, ¨iÎÇcÎÅaÎ, ¨iÎÇaÎÅaÎ
㨠¨000000 ¨7000* ¨ã ¨aÂ
ÇԹѠÇ320000 Ç4700* ǹä* ÇiιaÂ, Çi¹¹aÂÂaÎ, ÇiιaÂÂaÎ, Çi¹¹aÂ
ÇÅÑ Ç920000 Ç4700* ǹä* ÇaÎÅaÂ, Ço¹ÅaÂÂaÎ, ÇcÎÅaÂÂaÎ, ÇaÎÅaÂÂaÎ, Ço¹ÅaÂ, ÇcÎÅaÂ
 

[ Up | Prev | Next ]

Conclusions


A Demonstration

We have created an immediate demonstration page, providing a soundex search within the Thai Royal Institute Dictionary 2525 B.E. Edition. Try it with some words you're sure of their pronunciations but not of their spellings.


Last Updated: 1997-10-28
Theppitak Karoonboonyanan