OCFS: optimal orthogonal centroid feature selection for text categorization

  • Authors:
  • Jun Yan;Ning Liu;Benyu Zhang;Shuicheng Yan;Zheng Chen;Qiansheng Cheng;Weiguo Fan;Wei-Ying Ma

  • Affiliations:
  • Peking University, Beijing, P. R. China;Tsinghua University, Beijing, P. R. China;Microsoft Research Asia, Beijing, P. R. China;Chinese University of Hong Kong, Hong Kong;Microsoft Research Asia, Beijing, P. R. China;Peking University, Beijing, P. R. China;Virginia Polytechnic Institute and State University, Blacksburg, VA;Microsoft Research Asia, Beijing, P. R. China

  • Venue:
  • Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.