A General Framework of Feature Selection for Text Categorization

  • Authors:
  • Hongfang Jing;Bin Wang;Yahui Yang;Yan Xu

  • Affiliations:
  • Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100190 and Graduate University, Chinese Academy of Sciences, Beijing, China 100080;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100190;School of Software & Microelectronics, Peking University, Beijing, China 102600;Center of Network Information and Education Technology, Beijing Language, and Culture University, Beijing, China 100083

  • Venue:
  • MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection framework called Distribution-Based Feature Selection (DBFS) based on distribution difference of features. This framework generalizes most of the state-of-the-art feature selection methods including OCFS, MI, ECE, IG, CHI and OR. The performances of many feature selection methods can be estimated by theoretical analysis using components of this framework. Besides, DBFS sheds light on the merits and drawbacks of many existing feature selection methods. In addition, this framework helps to select suitable feature selection methods for specific domains. Moreover, a weighted model based on DBFS is given so that suitable feature selection methods for unbalanced datasets can be derived. The experimental results show that they are more effective than CHI, IG and OCFS on both balanced and unbalanced datasets.