Improving binary classification on text problems using differential word features

Authors:
Justin Martineau;Tim Finin;Anupam Joshi;Shamit Patel
Affiliations:
University of Maryland - Baltimore County, Baltimore, MD, USA;University of Maryland - Baltimore County, Baltimore, MD, USA;University of Maryland - Baltimore County, Baltimore, MD, USA;University of Maryland - Baltimore County, Baltimore, MD, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 5
Cited 3

Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Thumbs up?: sentiment classification using machine learning techniques

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A holistic lexicon-based approach to opinion mining

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Get out the vote: determining support or opposition from congressional floor-debate transcripts

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Enhanced geographically typed semantic schema matching

Web Semantics: Science, Services and Agents on the World Wide Web
Assembling the optimal sentiment classifiers

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
TISA: topic independence scoring algorithm

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe an efficient technique to weigh word-based features in binary classification tasks and show that it significantly improves classification accuracy on a range of problems. The most common text classification approach uses a document's ngrams (words and short phrases) as its features and assigns feature values equal to their frequency or TFIDF score relative to the training corpus. Our approach uses values computed as the product of an ngram's document frequency and the difference of its inverse document frequencies in the positive and negative training sets. While this technique is remarkably easy to implement, it gives a statistically significant improvement over the standard bag-of-words approaches using support vector machines on a range of classification tasks. Our results show that our technique is robust and broadly applicable. We provide an analysis of why the approach works and how it can generalize to other domains and problems.