Oscillating feature subset search algorithm for text categorization

  • Authors:
  • Jana Novovičová;Petr Somol;Pavel Pudil

  • Affiliations:
  • Dept. of Pattern Recognition, Institute of Information Theory and Automation, Academy of Sciences of the, Czech Republic;Dept. of Pattern Recognition, Institute of Information Theory and Automation, Academy of Sciences of the, Czech Republic;Dept. of Pattern Recognition, Institute of Information Theory and Automation, Academy of Sciences of the, Czech Republic

  • Venue:
  • CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

A major characteristic of text document categorization problems is the extremely high dimensionality of text data. In this paper we explore the usability of the Oscillating Search algorithm for feature/word selection in text categorization. We propose to use the multiclass Bhattacharyya distance for multinomial model as the global feature subset selection criterion for reducing the dimensionality of the bag of words vector document representation. This criterion takes into consideration inter-feature relationships. We experimentally compare three subset selection procedures: the commonly used best individual feature selection based on information gain, the same based on individual Bhattacharyya distance, and the Oscillating Search to maximize Bhattacharyya distance on groups of features. The obtained feature subsets are then tested on the standard Reuters data with two classifiers: the multinomial Bayes and the linear SVM. The presented experimental results illustrate that using a non-trivial feature selection algorithm is not only computationally feasible, but it also brings substantial improvement in classification accuracy over traditional, individual feature evaluation based methods.