Integrating feature and instance selection for text classification

  • Authors:
  • Dimitris Fragoudis;Dimitris Meretakis;Spiros Likothanassis

  • Affiliations:
  • University of Patras, Rio GR-26500, Greece;Zurich Financial Services, Switzerland;University of Patras, Rio GR-26500, Greece

  • Venue:
  • Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Instance selection and feature selection are two orthogonal methods for reducing the amount and complexity of data. Feature selection aims at the reduction of redundant features in a dataset whereas instance selection aims at the reduction of the number of instances. So far, these two methods have mostly been considered in isolation. In this paper, we present a new algorithm, which we call FIS (Feature and Instance Selection) that targets both problems simultaneously in the context of text classificationOur experiments on the Reuters and 20-Newsgroups datasets show that FIS considerably reduces both the number of features and the number of instances. The accuracy of a range of classifiers including Naïve Bayes, TAN and LB considerably improves when using the FIS preprocessed datasets, matching and exceeding that of Support Vector Machines, which is currently considered to be one of the best text classification methods. In all cases the results are much better compared to Mutual Information based feature selection. The training and classification speed of all classifiers is also greatly improved.