Baselines and bigrams: simple, good sentiment and topic classification

Authors:
Sida Wang;Christopher D. Manning
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Year:
2012

Citing 10
Cited 2

Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Mining and summarizing customer reviews

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Using appraisal groups for sentiment analysis

Proceedings of the 14th ACM international conference on Information and knowledge management
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A unified architecture for natural language processing: deep neural networks with multitask learning

Proceedings of the 25th international conference on Machine learning
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Dependency tree-based sentiment classification using CRFs with hidden variables

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Learning word vectors for sentiment analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Semi-supervised recursive autoencoders for predicting sentiment distributions

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Automatic requirement categorization of large natural language specifications at mercedes-benz for review improvements

REFSQ'13 Proceedings of the 19th international conference on Requirements Engineering: Foundation for Software Quality
Exploiting discourse information to identify paraphrases

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/dataset. We show that: (i) the inclusion of word bigram features gives consistent gains on sentiment analysis tasks; (ii) for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds); (iii) a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. Based on these observations, we identify simple NB and SVM variants which outperform most published results on sentiment analysis datasets, sometimes providing a new state-of-the-art performance level.