On Machine Learning Methods for Chinese Document Categorization

Authors:
Ji He;Ah-Hwee Tan;Chew-Lim Tan
Affiliations:
School of Computing, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260 heji@comp.nus.edu.sg;Nanyang Technological University, School of Computer Engineering, Blk N4, 2A-13 Nanyang Avenue, Singapore 639798. asahtan@ntu.edu.sg;School of Computing, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260. tancl@comp.nus.edu.sg
Venue:
Applied Intelligence
Year:
2003

Citing 0
Cited 22

Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Predictive neural networks for gene expression data analysis

Neural Networks
Machine learning for Arabic text categorization: Research Articles

Journal of the American Society for Information Science and Technology
Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Support vector machines based Arabic language text classification system: feature selection comparative study

MATH'07 Proceedings of the 12th WSEAS International Conference on Applied Mathematics
Fast logistic regression for text categorization with variable-length n-grams

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Intelligence Through Interaction: Towards a Unified Theory for Learning

ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Advances in Neural Networks
Patent classification system using a new hybrid genetic algorithm support vector machine

Applied Soft Computing
Agent-augmented co-space: toward merging of real world and cyberspace

ATC'10 Proceedings of the 7th international conference on Autonomic and trusted computing
A logistic regression-based smoothing method for Chinese text categorization

Expert Systems with Applications: An International Journal
Building a qualitative recruitment system via SVM with MCDM approach

Applied Intelligence
Feature sub-set selection metrics for Arabic text classification

Pattern Recognition Letters
Automatic keyphrases extraction from document using neural network

ICMLC'05 Proceedings of the 4th international conference on Advances in Machine Learning and Cybernetics
Using the absolute difference of term occurrence probabilities in binary text categorization

Applied Intelligence
Automatic chinese text classification using n-gram model

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part III
Learning outliers to refine a corpus for chinese webpage categorization

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part I
Ontology-Based similarity between text documents on manifold

ASWC'06 Proceedings of the First Asian conference on The Semantic Web
Automatic folder allocation system using Bayesian-support vector machines hybrid classification approach

Applied Intelligence
A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine

Expert Systems with Applications: An International Journal
An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Applied Intelligence
Nonlinear transformation of term frequencies for term weighting in text categorization

Engineering Applications of Artificial Intelligence
Using statistical tools to determine the significance and relative importance of the main parameters of an evolutionary algorithm

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports our comparative evaluation of three machine learning methods, namely k Nearest Neighbor (kNN), Support Vector Machines (SVM), and Adaptive Resonance Associative Map (ARAM) for Chinese document categorization. Based on two Chinese corpora, a series of controlled experiments evaluated their learning capabilities and efficiency in mining text classification knowledge. Benchmark experiments showed that their predictive performance were roughly comparable, especially on clean and well organized data sets. While kNN and ARAM yield better performances than SVM on small and clean data sets, SVM and ARAM significantly outperformed kNN on noisy data. Comparing efficiency, kNN was notably more costly in terms of time and memory than the other two methods. SVM is highly efficient in learning from well organized samples of moderate size, although on relatively large and noisy data the efficiency of SVM and ARAM are comparable.