An efficient feature ranking measure for text categorization

  • Authors:
  • Songbo Tan;Yuefen Wang;Xueqi Cheng

  • Affiliations:
  • Chinese Academy of Sciences, China;Chinese Academy of Geological Sciences, China;Chinese Academy of Sciences, China

  • Venue:
  • Proceedings of the 2008 ACM symposium on Applied computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A major obstacle that decreases the performance of text classifiers is the extremely high dimensionality of text data. To reduce the dimension, a number of approaches based on rough-set theory have been proposed. However, these works often suffer from two problems: the first is that they cannot directly deal with continuous text features; the second is that they often incur considerable running time. To deal with the first issue, we make some extensions to discernibility matrix so that it can work with continuous features. To cut down running time, we employ centroids rather than examples to construct discernibility matrix, which reduce the time complexity from O(T2W) to O(K2W) where T denotes the size of training examples, K denotes the number of training classes and W denotes the size of vocabulary. The experimental results indicate that proposed method not only yields much higher accuracy than Information Gain when the number of selected features is smaller than 6000, but also incurs much smaller CPU time than Information Gain.