Simultaneous character-cluster-based word segmentation and named entity recognition in Thai language

  • Authors:
  • Nattapong Tongtep;Thanaruk Theeramunkong

  • Affiliations:
  • School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani, Thailand;School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani, Thailand

  • Venue:
  • KICSS'10 Proceedings of the 5th international conference on Knowledge, information, and creativity support systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Named entity recognition in inherent-vowel alphabetic languages such as Burmese, Khmer, Lao, Tamil, Telugu, Bali, and Thai, is difficult since there are no explicit boundaries among words or sentences. This paper presents a novel method to exploit the concept of character clusters, a sequence of inseparable characters, to group characters into clusters, utilize statistics among characters and their clusters to extract Thai words and then recognize named entities, simultaneously. Integrated of two phases, the word-segmentation model and the namedentity-recognition model, context features are exploited to learn parameters for these two discriminative probabilistic models, i.e., CRFs, to rank a set of word and named entity candidates generated. The experimental result shows that our method significantly increases the performance of segmenting word and recognizing entities with the F-measure of 96.14% and 83.68%, respectively.