A Thai Soundex System for Spelling Correction


Theppitak Karoonboonyanan1 , Virach Sornlertlamvanich1,2 , Surapant Meknavin1
1Linguistics and Knowledge Science Laboratory
National Electronics and Computer Technology Center
Ministry of Science Technology and Environment, Thailand
2Department of Computer Science
Tokyo Institute of Technology, Japan

[ NLPRS'97 conference paper (.ps) is available ] [ See also a demonstration ]

Contents


Motivation

  1. Requirement of Tolerant Search in the Royal Institute Dictionary (CD-ROM)
  2. Other Applications
[ Up | Next ]

Previous Works

Common Characteristics

[ Up | Prev | Next ]

The Problems

Some problems in the previous schemes cause wrong grouping in many cases:
  1. No difference between initial and final consonants. All but the first consonants are treated as syllable enders.
  2. Different spellings of vowel sound are not recognized. Thus, some homophones are separated.
  3. Propagation on final consonants is not covered.
  4. Variably pronounced consonants are not always determined.
[ Up | Prev | Next ]

The Proposed System

  1. The Soundex Code will keep common pronunciation features:
    1. Initial Consonant Sound (20 groups)
    2. Vowel Sound (12 groups)
    3. Final Consonant Sound (9 groups)
    and eliminate variations:
    1. Clustering
    2. Vowel Length
    3. Tone
  2. The Encoding Machine
  3. The Entry Matching Strategy
[ Up | Prev | Next ]

A Model for Soundex Encoding Machine

Suppose, for example, that the only valid syllables are:
  1. ini-cons fnl-cons
  2. ini cons '' [fnl-cons]
  3. '' ini-cons ''
  4. '' ini-cons [fnl-cons]
where
ini-con ::= cons | cons cons
cons ::= '' | ''
fnl-cons ::= '' | ''
To construct a converter, we add the outputs to the rules:
  1. ini-cons (e/'o') fnl-cons
  2. ini-cons (''/'a') (e/'')
  3. ini-cons (''/'a') fnl-cons
  4. (''/e) ini-cons (''/'a') (e/'')
  5. (''/e) ini-cons (e/'e') (e/'')
  6. (''/e) ini-cons (e/'e') fnl-cons
where
ini-con ::= cons | cons (e/'a') (e/'') cons
cons ::= (''/'') | (''/'')
fnl-cons ::= (''/'') | (''/'')
Then, we construct a converter machine from the union set of all rules:

And, by substituting macros:

Here are the only two possible executions on a input transcription "ʹ":
(1) -/e-> (5) -/-> (c) -e/a-> (d) -e/-> (e) -/-> (6) -/a-> (7) -e/-> (8)*
(1) -/e-> (5) -/-> (c) -e/e-> (6) -e/e-> (4) -e/-> (8) -e/e-> (1) -/-> (9) -e/e-> (2) -/a-> (4) -e/-> (8)*
which yield soundex codes "aιa" and "eιa", respectively.
[ Up | Prev | Next ]

Handling Final Consonant Propagation

In Thai, a final consonant can sometimes function as initial consonant of the next syllable.
ѵ = +
One quick solution is to provide a short-circuit machanism: To express this mechanism in the rules, we introduce two new notations:
TO(label; in/out) at current position, branch an (in/out) edge to label state.
TI(label) current position is reachable from label state by an (e/e) edge.
For example, suppose that the original rules are:
  1. ini-cons (' '/'a') fnl-cons
  2. ini-cons (''/'a') (e/'')
where
ini-cons ::= (''/'') | (''/'') | (''/'')
fnl-cons ::= (''/'') | (''/'')
Here the word 'ѵ' is valid, but 'ѵ' is still not. However, with the short-circuit notation, we modify the rules like this:
  1. ini-cons TI(P) (' '/'a') fnl-cons
  2. ini-cons TI(P) (''/'a') (e/'')
where
ini-cons ::= (''/'') | (''/'') | (''/'')
fnl-cons ::= (''/'') TO(P; e/'') | (''/'') TO(P; e/'')
which yields the following machine:

The execution of 'ѵ', which generates the corresponding code 'aa', is:

(1) -/-> (2) - /a-> (3) -/-> (6) -e/-> (P) -e/e-> (2) -/a-> (4) -e/-> (7)*
[ Up | Prev | Next ]

Construction of the Conversion Rules

Thai syllables are composed of five components:
  1. Initial Consonants
  2. Vowel
  3. Final Consonants
  4. Tone
  5. Cancelled Letters

Initial Consonants

Final Consonants

[ Up | Prev | Next ]

Entry Matching

Two phases for a soundex system:
  1. Preparation.
  2. Query Processing.
The new key could be indexed by using trie. Then, the encoding machine could be directly cascaded to the trie. There are two advantages:
[ Up | Prev | Next ]

Evaluation

An Experimental System

The Test

The Results

Comparison on Similar Words Grouping:
Words [Udom83] [LK82] [Arun91] Our Model
ä  300000  0000    a 
ѹ 300000 0000 a, aa
ح 
 
330000* 
 
E490 
 
ٹ* 
 
ua, uoa, uaa, uca, uaa, ua
س 320000* E490 ٹ* ua, uaa, uιca, uιaa
Ѻô 559400* 5430* * aao, aaλaoa, aaoa, aaλacδc, aaλaaδc, aacδc, aaaδc, aaλacδa, aaλaaδa, aacδa, aaaδa, aaλao
ѻ 594000* 5300* * aao, aaoa, aoa, aaΫc, ac, aacΫc, aaaΫc, acΫc, aaΫc, aaΫa, aa, aacΫa, aaaΫa, acΫa, aaΫa, ao
000000* 7000* * a
200000* 9700* * a, aa, ac, aa
  000000  9700*  * a, aa, ac, aa
000000 B900* * a
900000* 0000* * c, cc, ac, ca, aa
͹ 630000* 4000* ͹* c, ca, coa, aoa, cιc, ccιc, acιc, caιc, aaιc,  
cιa, ccιa, acιa, caιa, aaιa, co, ao
300000* 6000 * aa, aa, coa, cca, caa, ccoa, acoa, caoa, aaoa, ccca, acca, caca, ccaa, aaca, acaa, caaa, aaaa
000000* 6000 * aa, aaa

Comparison on Different Words Separation:
 
Words [Udom83] [LK82] [Arun91] Our Model
ѭ 330000* 4900* * aa, aaa
ѹ 330000* 4900* * aa, aaιa
400000* 3600* մ* iΫo, iΫoa, iΫcc, iΫca
մ 400000* 3600* մ* iδa
000000* 6000 o, oa, cc, ca
000000* B600 e, ea, ec, ea, ae
000000* 7000 i, ia, ic, ia
000000* 7000* * i, ia, ic, ia
000000* 7000* * $
319000 7100* aa, aaa, aaΡa, caΡa, aaaΡa, caΡaa, cccΡaa, ...
ԡ 921000 7100* ¡ ai, aia, caia, aaia, aiΡc, caiΡc, aaiΡc, aiΡa, caiΡa, aaiΡa, cai, aai
ѡ 021000 7100* ¡ aia, ia, cia
090000 7000* ǹ ic, icc, iac, ica, iaa
000000 7000* a
Թ 320000 4700* ǹ* iιa, iaa, iιaa, ia
920000 4700* ǹ* aa, oaa, caa, aaa, oa, ca
 

[ Up | Prev | Next ]

Conclusions


A Demonstration

We have created an immediate demonstration page, providing a soundex search within the Thai Royal Institute Dictionary 2525 B.E. Edition. Try it with some words you're sure of their pronunciations but not of their spellings.


Last Updated: 1997-10-28
Theppitak Karoonboonyanan