Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
BEST Corpus Development and Analysis
IALP '09 Proceedings of the 2009 International Conference on Asian Language Processing
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
IEEE Transactions on Information Theory
Hi-index | 0.00 |
Word boundary ambiguity is a major problem for the Thai morphological analysis since the Thai words are written consecutively with no word delimiters. However the part of speech (POS) tagged corpus which has been used is constructed from the academic papers and there are no researches that worked on the documents written in the informal language. This paper presents Thai morphological analysis with unknown word boundary detection using both POS tagged and untagged corpora. Viterbi algorithm and Maximum Entropy (ME) - Viterbi algorithm are employed separately to evaluate our methods. The unknown word problem is handled by making use of string's length in order to estimate word boundaries. The experiments are performed on documents written in formal language and documents written in informal language. The experiments show that the method we proposed to use untagged corpus in addition to tagged corpus is efficient for the text written in informal language.