Character cluster based Thai information retrieval
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Machine Learning
Thai speech processing technology: A review
Speech Communication
Word segmentation for the Myanmar language
Journal of Information Science
Thai Word Segmentation with Hidden Markov Model and Decision Tree
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
NE tagging for Urdu based on bootstrap POS learning
CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Hi-index | 0.00 |
For languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently good, it will lead to a great number of unknown or unrecognized words. These unrecognized words certainly reduce segmentation accuracy. To solve such problem, we propose a method based on decision tree models. Without use of a dictionary, specific information, called syntactic attribute, is applied to identify the structure of Thai words. C4.5 is used as a tool for this purpose. Using a Thai corpus, experiment results show that our method outperforms some well-known dictionary-dependent techniques, maximum and longest matching methods, in case of no dictionary.