Compact features for sentiment analysis

Authors:
Lisa Gaudette;Nathalie Japkowicz
Affiliations:
School of Information Technology & Engineering, University of Ottawa, Ottawa, Ontario, Canada;School of Information Technology & Engineering, University of Ottawa, Ottawa, Ontario, Canada
Venue:
Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Year:
2011

Citing 7
Cited 0

Mining the peanut gallery: opinion extraction and semantic classification of product reviews

WWW '03 Proceedings of the 12th international conference on World Wide Web
Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

IEEE Transactions on Knowledge and Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
Evaluation Methods for Ordinal Classification

Canadian AI '09 Proceedings of the 22nd Canadian Conference on Artificial Intelligence: Advances in Artificial Intelligence

Quantified Score

Hi-index	0.01

Visualization

Abstract

This work examines a novel method of developing features to use for machine learning of sentiment analysis and related tasks. This task is frequently approached using a "Bag of Words" representation - one feature for each word encountered in the training data - which can easily involve thousands of features. This paper describes a set of compact features developed by learning scores for words, dividing the range of possible scores into a number of bins, and then generating features based on the distribution of scored words in the document over the bins. This allows for effective learning of sentiment and related tasks with 25 features; in fact, performance was very often slightly better with these features than with a simple bag of words baseline. This vast reduction in the number of features reduces training time considerably on large datasets, and allows for using much larger datasets than previously attempted with bag of words approaches, improving performance.