Data representation in machine learning-based sentiment analysis of customer reviews

Authors:
Ivan Shamshurin
Affiliations:
National Research University - Higher School of Economics, School of Applied, Mathematics and Informatics, Moscow, Russia
Venue:
PReMI'11 Proceedings of the 4th international conference on Pattern recognition and machine intelligence
Year:
2011

Citing 12
Cited 0

Information Retrieval

Information Retrieval
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Predicting the semantic orientation of adjectives

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Mining and summarizing customer reviews

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Lecture Notes in Data Mining

Lecture Notes in Data Mining
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
Overview and semantic issues of text mining

ACM SIGMOD Record
Programming collective intelligence

Programming collective intelligence
Applied Data Mining for Business and Industry

Applied Data Mining for Business and Industry
Natural Language Processing with Python

Natural Language Processing with Python
Programming Python

Programming Python

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the problem of extracting opinions from natural language texts, which is one of the tasks of sentiment analysis. We provide an overview of existing approaches to sentiment analysis including supervised (Naive Bayes, maximum entropy, and SVM) and unsupervised machine learning methods. We apply three supervised learning methods-Naive Bayes, KNN, and a method based on the Jaccard index - to the dataset of Internet user reviews about cars and report the results. When learning a user opinion on a specific feature of a car such as speed or comfort, it turns out that training on full unprocessed reviews decreases the classification accuracy. We experiment with different approaches to preprocessing reviews in order to obtain representations that are relevant for the feature one wants to learn and show the effect of each representation on the accuracy of classification.