Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A comparative study on text representation schemes in text categorization
Pattern Analysis & Applications
Vietnamese Document Representation and Classification
AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Modelling citation networks for improving scientific paper classification performance
PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Combining contents and citations for scientific document classification
AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
Batch-Mode Active Learning with Semi-supervised Cluster Tree for Text Classification
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
This paper proposes a novel text representation for Web pages written in Vietnamese. This representation is based on an analysis of Vietnamese documents at phonetic level in which each document will be represented as a bag of phonemes. It is designed to capture sound-based information in documents and to be helpful for resolving some non-topic text classification problems including automatic Vietnamese language identification of a document, ancient Vietnamese document detection, author identification, and poem identification. We apply some typical machine learning methods including NB, KNN and SVMs to build text classifiers. The experimental results show a significant improvement in terms of effectiveness and efficiency compared to the traditional syllable based representation in most cases.