Using tagged and untagged corpora to improve thai morphological analysis with unknown word boundary detections

Authors:
Wimvipa Luangpiensamut;Kanako Komiya;Yoshiyuki Kotani
Affiliations:
Tokyo University of Agriculture and Thechnology, Koganei, Tokyo, Japan;Tokyo University of Agriculture and Thechnology, Koganei, Tokyo, Japan;Tokyo University of Agriculture and Thechnology, Koganei, Tokyo, Japan
Venue:
PRICAI'12 Proceedings of the 12th Pacific Rim international conference on Trends in Artificial Intelligence
Year:
2012

Citing 3
Cited 0

Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
BEST Corpus Development and Analysis

IALP '09 Proceedings of the 2009 International Conference on Asian Language Processing
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word boundary ambiguity is a major problem for the Thai morphological analysis since the Thai words are written consecutively with no word delimiters. However the part of speech (POS) tagged corpus which has been used is constructed from the academic papers and there are no researches that worked on the documents written in the informal language. This paper presents Thai morphological analysis with unknown word boundary detection using both POS tagged and untagged corpora. Viterbi algorithm and Maximum Entropy (ME) - Viterbi algorithm are employed separately to evaluate our methods. The unknown word problem is handled by making use of string's length in order to estimate word boundaries. The experiments are performed on documents written in formal language and documents written in informal language. The experiments show that the method we proposed to use untagged corpus in addition to tagged corpus is efficient for the text written in informal language.