Feature selection on Chinese text classification using character n-grams

  • Authors:
  • Zhihua Wei;Duoqian Miao;Jean-Hugues Chauchat;Caiming Zhong

  • Affiliations:
  • Tongji University, Key laboratory "Embedded System and Service Computing", Ministry of Education, Shanghai, China and Université Lumière Lyon 2, Laboratoire ERIC, Bron Cedex, France;Tongji University, Key laboratory "Embedded System and Service Computing", Ministry of Education, Shanghai, China;Université Lumière Lyon 2, Laboratoire ERIC, Bron Cedex, France;Tongji University, Key laboratory "Embedded System and Service Computing", Ministry of Education, Shanghai, China

  • Venue:
  • RSKT'08 Proceedings of the 3rd international conference on Rough sets and knowledge technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 classes. We use different n-gram feature (1-, 2-grams or 1-, 2-, 3-grams) to represent documents. Different feature weights (absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency) are compared. The sparseness of "document by feature" matrices is analyzed in various cases. We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task. We perform our experiments in the TANAGRA platform. We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.