Feature selection on Chinese text classification using character n-grams

Authors:
Zhihua Wei;Duoqian Miao;Jean-Hugues Chauchat;Caiming Zhong
Affiliations:
Tongji University, Key laboratory "Embedded System and Service Computing", Ministry of Education, Shanghai, China and Université Lumière Lyon 2, Laboratoire ERIC, Bron Cedex, France;Tongji University, Key laboratory "Embedded System and Service Computing", Ministry of Education, Shanghai, China;Université Lumière Lyon 2, Laboratoire ERIC, Bron Cedex, France;Tongji University, Key laboratory "Embedded System and Service Computing", Ministry of Education, Shanghai, China
Venue:
RSKT'08 Proceedings of the 3rd international conference on Rough sets and knowledge technology
Year:
2008

Citing 5
Cited 3

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A novel refinement approach for text categorization

Proceedings of the 14th ACM international conference on Information and knowledge management
Working Set Selection Using Second Order Information for Training Support Vector Machines

The Journal of Machine Learning Research

Class-driven correlation learning for chinese document categorization using discriminative features

Proceedings of the Third International Conference on Internet Multimedia Computing and Service
Sentiment classification of Chinese online reviews: analysing and improving supervised machine learning

International Journal of Web Engineering and Technology
Analyzing sentiments in Web 2.0 social media data in Chinese: experiments on business and marketing related Chinese Web forums

Information Technology and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 classes. We use different n-gram feature (1-, 2-grams or 1-, 2-, 3-grams) to represent documents. Different feature weights (absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency) are compared. The sparseness of "document by feature" matrices is analyzed in various cases. We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task. We perform our experiments in the TANAGRA platform. We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.