Simultaneous character-cluster-based word segmentation and named entity recognition in Thai language

Authors:
Nattapong Tongtep;Thanaruk Theeramunkong
Affiliations:
School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani, Thailand;School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani, Thailand
Venue:
KICSS'10 Proceedings of the 5th international conference on Knowledge, information, and creativity support systems
Year:
2010

Citing 8
Cited 0

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
Character cluster based Thai information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Named entity recognition using a character-based probabilistic approach

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Word segmentation for the Myanmar language

Journal of Information Science
Computers and the Thai Language

IEEE Annals of the History of Computing
A Feature-Based Approach for Relation Extraction from Thai News Documents

PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
Brief Communication: Two-phase biomedical named entity recognition using CRFs

Computational Biology and Chemistry

Quantified Score

Hi-index	0.00

Visualization

Abstract

Named entity recognition in inherent-vowel alphabetic languages such as Burmese, Khmer, Lao, Tamil, Telugu, Bali, and Thai, is difficult since there are no explicit boundaries among words or sentences. This paper presents a novel method to exploit the concept of character clusters, a sequence of inseparable characters, to group characters into clusters, utilize statistics among characters and their clusters to extract Thai words and then recognize named entities, simultaneously. Integrated of two phases, the word-segmentation model and the namedentity-recognition model, context features are exploited to learn parameters for these two discriminative probabilistic models, i.e., CRFs, to rank a set of word and named entity candidates generated. The experimental result shows that our method significantly increases the performance of segmenting word and recognizing entities with the F-measure of 96.14% and 83.68%, respectively.