Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
POS-tagger for English-Vietnamese bilingual corpus
HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Phoneme Based Representation for Vietnamese Web Page Classification
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.