Automatic corpus-based Thai word extraction with the c4.5 learning algorithm

  • Authors:
  • Virach Sornlertlamvanich;Tanapong Potipiti;Thatsanee Charoenporn

  • Affiliations:
  • Ministry of Science and Technology Environment, Bangkok, Thailand;Ministry of Science and Technology Environment, Bangkok, Thailand;Ministry of Science and Technology Environment, Bangkok, Thailand

  • Venue:
  • COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

"Word" is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria or procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation. We employ the c4.5 learning algorithm for this task. Several attributes such as string length, frequency, mutual information and entropy are chosen for word/non-word determination. Our experiment yields high precision results about 85% in both training and test corpus.