A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

  • Authors:
  • Jieming Yang;Yuanning Liu;Xiaodong Zhu;Zhen Liu;Xiaoxu Zhang

  • Affiliations:
  • College of Computer Science and Technology, Jilin University, Changchun, Jilin, China and School of Information Engineering, Northeast Dianli University, Jilin, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China and Graduate School of Engineering, Nagasaki Institute of Applied Science, Nagasaki-shi, Nagasaki, Japan;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naive Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.