Diacritics restoration in vietnamese: letter based vs. syllable based model

  • Authors:
  • Kiem-Hieu Nguyen;Cheol-Young Ock

  • Affiliations:
  • Natural Language Processing Lab, School of Computer Engineering and Information, Technology, University of Ulsan, Korea;Natural Language Processing Lab, School of Computer Engineering and Information, Technology, University of Ulsan, Korea

  • Venue:
  • PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present some approaches to diacritics restoration in Vietnamese, based on letters and syllables. Experiments with language-specified feature selection are conducted to evaluate contribution of different types of feature. Experimental results reveal that combination of Adaboost and C4.5, using letter-based feature set, achieves 94.7% accuracy, which is competitive with other systems for diacritics restoration in Vietnamese. Test data for diacritics restoration task in Vietnamese could be freely collected with simple preprocessing, whereas large test data for many natural language processing tasks in Vietnamese is lack. So, diacritic restoration could be used as an application-driven evaluation framework for lexical disambiguation tasks.