Using feature construction to avoid large feature spaces in text classification

Authors:
Elijah Mayfield;Carolyn Penstein-Rosé
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the 12th annual conference on Genetic and evolutionary computation
Year:
2010

Citing 14
Cited 2

A genetic programming approach for robust language interpretation

Advances in genetic programming
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks

Genetic Programming and Evolvable Machines
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Genetic Programming For Attribute Construction In Data Mining

GECCO '02 Proceedings of the Genetic and Evolutionary Computation Conference
Genetic Programming with a Genetic Algorithm for Feature Construction and Selection

Genetic Programming and Evolvable Machines
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Stylistic text classification using functional lexical features: Research Articles

Journal of the American Society for Information Science and Technology
Evolving Lucene search queries for text classification

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Using learning to facilitate the evolution of features for recognizing visual concepts

Evolutionary Computation
Interactive annotation learning with indirect feature voting

SRWS '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
Generalizing dependency features for opinion mining

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Improving query expansion with stemming terms: a new genetic algorithm approach

EvoCOP'08 Proceedings of the 8th European conference on Evolutionary computation in combinatorial optimization
Efficient convolution kernels for dependency and constituent syntactic trees

ECML'06 Proceedings of the 17th European conference on Machine Learning

Sentiment classification using automatically extracted subgraph features

CAAGET '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text
Modeling of stylistic variation in social media with stretchy patterns

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature space design is a critical part of machine learning. This is an especially difficult challenge in the field of text classification, where an arbitrary number of features of varying complexity can be extracted from documents as a preprocessing step. A challenge for researchers has consistently been to balance expressiveness of features with the size of the corresponding feature space, due to issues with data sparsity that arise as feature spaces grow larger. Drawing on past successes utilizing genetic programming in similar problems outside of text classification, we propose and implement a technique for constructing complex features from simpler features, and adding these more complex features into a combined feature space which can then be utilized by more sophisticated machine learning classifiers. Applying this technique to a sentiment analysis problem, we show encouraging improvement in classification accuracy, with a small and constant increase in feature space size. We also show that the features we generate carry far more predictive power than any of the simple features they contain.