A minimum cluster-based trigram statistical model for Thai syllabification

  • Authors:
  • Chonlasith Jucksriporn;Ohm Sornil

  • Affiliations:
  • Department of Computer Science, National Institute of Development Administration, Bangkok, Thailand;Department of Computer Science, National Institute of Development Administration, Bangkok, Thailand

  • Venue:
  • CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Syllabification is a process of extracting syllables from a word. Problems of syllabification are majorly caused from unknown and ambiguous words. This research aims to resolve these problems in Thai language by exploiting relationships among characters in the word. A character clustering scheme is proposed to generate units smaller than a syllable, called Thai Minimum Clusters (TMCs), from a word. TMCs are then merged into syllables using a trigram statistical model. Experimental evaluations are performed to assess the effectiveness of the proposed technique on a standard data set of 77,303 words. The results show that the technique yields 97.61% accuracy.