Ambiguity measure feature-selection algorithm

  • Authors:
  • Saket S. R. Mengle;Nazli Goharian

  • Affiliations:
  • Information Retrieval Lab, Illinois Institute of Technology, Chicago, IL 60616;Information Retrieval Lab, Illinois Institute of Technology, Chicago, IL 60616

  • Venue:
  • Journal of the American Society for Information Science and Technology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier. We favorably show the effectiveness of our approach in outperforming eight existing feature-selection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets. © 2009 Wiley Periodicals, Inc.