Using tagged and untagged corpora to improve thai morphological analysis with unknown word boundary detections

  • Authors:
  • Wimvipa Luangpiensamut;Kanako Komiya;Yoshiyuki Kotani

  • Affiliations:
  • Tokyo University of Agriculture and Thechnology, Koganei, Tokyo, Japan;Tokyo University of Agriculture and Thechnology, Koganei, Tokyo, Japan;Tokyo University of Agriculture and Thechnology, Koganei, Tokyo, Japan

  • Venue:
  • PRICAI'12 Proceedings of the 12th Pacific Rim international conference on Trends in Artificial Intelligence
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Word boundary ambiguity is a major problem for the Thai morphological analysis since the Thai words are written consecutively with no word delimiters. However the part of speech (POS) tagged corpus which has been used is constructed from the academic papers and there are no researches that worked on the documents written in the informal language. This paper presents Thai morphological analysis with unknown word boundary detection using both POS tagged and untagged corpora. Viterbi algorithm and Maximum Entropy (ME) - Viterbi algorithm are employed separately to evaluate our methods. The unknown word problem is handled by making use of string's length in order to estimate word boundaries. The experiments are performed on documents written in formal language and documents written in informal language. The experiments show that the method we proposed to use untagged corpus in addition to tagged corpus is efficient for the text written in informal language.