Non-dictionary-based Thai word segmentation using decision trees

Authors:
Thanaruk Theeramunkong;Sasiporn Usanavasin
Affiliations:
Thammasat University, Pathumthani, Thailand;Thammasat University, Pathumthani, Thailand
Venue:
HLT '01 Proceedings of the first international conference on Human language technology research
Year:
2001

Citing 2
Cited 4

Character cluster based Thai information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Induction of Decision Trees

Machine Learning

Thai speech processing technology: A review

Speech Communication
Word segmentation for the Myanmar language

Journal of Information Science
Thai Word Segmentation with Hidden Markov Model and Decision Tree

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
NE tagging for Urdu based on bootstrap POS learning

CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies

Quantified Score

Hi-index	0.00

Visualization

Abstract

For languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently good, it will lead to a great number of unknown or unrecognized words. These unrecognized words certainly reduce segmentation accuracy. To solve such problem, we propose a method based on decision tree models. Without use of a dictionary, specific information, called syntactic attribute, is applied to identify the structure of Thai words. C4.5 is used as a tool for this purpose. Using a Thai corpus, experiment results show that our method outperforms some well-known dictionary-dependent techniques, maximum and longest matching methods, in case of no dictionary.