LearnLexTo: a machine-learning based word segmentation for indexing Thai texts

Authors:
Choochart Haruechaiyasak;Sarawoot Kongyoung;Chaianun Damrongrat
Affiliations:
National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand;National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand;National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand
Venue:
Proceedings of the 2nd ACM workshop on Improving non english web searching
Year:
2008

Citing 4
Cited 0

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
A collaborative framework for collecting Thai unknown words from the web

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions

Quantified Score

Hi-index	0.00

Visualization

Abstract

Thai language is considered as an unsegmented language in which words are written continuously without the use of word delimiters. To index Thai texts via the inverted index, a word segmentation algorithm is usually required to tokenize a text into a series of terms. Recent works on word segmentation reported Conditional Random Fields (CRFs) as the best machine learning algorithm, outperforming the dictionary-based approach and other machine learning algorithms. Our main contribution is to propose a new hybrid approach, LearnLexTo, which further improves the CRF model by integrating the dictionary-based approach. The key idea is to solve the ambiguity problem in the CRF model by using the dictionary-based approach which relies on a valid word set. Experimental results showed that the proposed hybrid approach yields the highest F1 value of 88.46%, compared to 82.07% by using the dictionary-based approach and 85.71% by using the CRF model.