Using feature construction to avoid large feature spaces in text classification

  • Authors:
  • Elijah Mayfield;Carolyn Penstein-Rosé

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA

  • Venue:
  • Proceedings of the 12th annual conference on Genetic and evolutionary computation
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature space design is a critical part of machine learning. This is an especially difficult challenge in the field of text classification, where an arbitrary number of features of varying complexity can be extracted from documents as a preprocessing step. A challenge for researchers has consistently been to balance expressiveness of features with the size of the corresponding feature space, due to issues with data sparsity that arise as feature spaces grow larger. Drawing on past successes utilizing genetic programming in similar problems outside of text classification, we propose and implement a technique for constructing complex features from simpler features, and adding these more complex features into a combined feature space which can then be utilized by more sophisticated machine learning classifiers. Applying this technique to a sentiment analysis problem, we show encouraging improvement in classification accuracy, with a small and constant increase in feature space size. We also show that the features we generate carry far more predictive power than any of the simple features they contain.