Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A novel refinement approach for text categorization
Proceedings of the 14th ACM international conference on Information and knowledge management
Working Set Selection Using Second Order Information for Training Support Vector Machines
The Journal of Machine Learning Research
Class-driven correlation learning for chinese document categorization using discriminative features
Proceedings of the Third International Conference on Internet Multimedia Computing and Service
International Journal of Web Engineering and Technology
Information Technology and Management
Hi-index | 0.00 |
In this paper, we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 classes. We use different n-gram feature (1-, 2-grams or 1-, 2-, 3-grams) to represent documents. Different feature weights (absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency) are compared. The sparseness of "document by feature" matrices is analyzed in various cases. We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task. We perform our experiments in the TANAGRA platform. We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.