Vietnamese Document Representation and Classification

  • Authors:
  • Giang-Son Nguyen;Xiaoying Gao;Peter Andreae

  • Affiliations:
  • School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand

  • Venue:
  • AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.