A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques

  • Authors:
  • Jakkrit Techo;Cholwich Nattee;Thanaruk Theeramunkong

  • Affiliations:
  • School of Information, Computer and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Pathumthani, Thailand 12000;School of Information, Computer and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Pathumthani, Thailand 12000;School of Information, Computer and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Pathumthani, Thailand 12000

  • Venue:
  • PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a corpus-based approach for automatic unknown word recognition in Thai. This approach applies an ensemble learning technique to generate a model for classifying unknown word candidates using features obtained from a corpus. We propose a technique called "group-based evaluation by ranking". It clusters the unknown word candidates into groups based on the occuring locations. The candidate with the highest accuracy is then identified as an unknown word. In this task, the number of positive instances is dominantly smaller than that of negative instances, forming an unbalanced data set. To improve the prediction accuracy, we apply a boosting technique with "voting under group-based evaluation by ranking". We have conducted experiments on real-world data to evaluate the performance of the proposed approach. The experiments compared the accuracy of our technique with an ordinary naïve Bayes technique. Our technique achieves the accuracy 90.93±0.50% when the first rank is selected and 97.90±0.26% when the candidates up to the tenth rank are considered. This is 6.79% to 8.45% improvement.