Vietnamese Document Representation and Classification

Authors:
Giang-Son Nguyen;Xiaoying Gao;Peter Andreae
Affiliations:
School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand
Venue:
AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Year:
2009

Citing 2
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
POS-tagger for English-Vietnamese bilingual corpus

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

Phoneme Based Representation for Vietnamese Web Page Classification

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.