Character cluster based Thai information retrieval
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
A trigram statistical language model algorithm for Chinese word segmentation
FAW'07 Proceedings of the 1st annual international conference on Frontiers in algorithmics
Hi-index | 0.00 |
Syllabification is a process of extracting syllables from a word. Problems of syllabification are majorly caused from unknown and ambiguous words. This research aims to resolve these problems in Thai language by exploiting relationships among characters in the word. A character clustering scheme is proposed to generate units smaller than a syllable, called Thai Minimum Clusters (TMCs), from a word. TMCs are then merged into syllables using a trigram statistical model. Experimental evaluations are performed to assess the effectiveness of the proposed technique on a standard data set of 77,303 words. The results show that the technique yields 97.61% accuracy.