Diacritics restoration in vietnamese: letter based vs. syllable based model

Authors:
Kiem-Hieu Nguyen;Cheol-Young Ock
Affiliations:
Natural Language Processing Lab, School of Computer Engineering and Information, Technology, University of Ulsan, Korea;Natural Language Processing Lab, School of Computer Engineering and Information, Technology, University of Ulsan, Korea
Venue:
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Year:
2010

Citing 6
Cited 0

On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Diacritics Restoration: Learning from Letters versus Learning from Words

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A Hybrid Approach to Word Segmentation of Vietnamese Texts

Language and Automata Theory and Applications
Constrained Sequence Classification for Lexical Disambiguation

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Automatic diacritic restoration for resource-scarce languages

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present some approaches to diacritics restoration in Vietnamese, based on letters and syllables. Experiments with language-specified feature selection are conducted to evaluate contribution of different types of feature. Experimental results reveal that combination of Adaboost and C4.5, using letter-based feature set, achieves 94.7% accuracy, which is competitive with other systems for diacritics restoration in Vietnamese. Test data for diacritics restoration task in Vietnamese could be freely collected with simple preprocessing, whereas large test data for many natural language processing tasks in Vietnamese is lack. So, diacritic restoration could be used as an application-driven evaluation framework for lexical disambiguation tasks.