Using corpus analysis to inform research into opinion detection in blogs

Authors:
Deanna Osman;John Yearwood;Peter Vamplew
Affiliations:
University of Ballarat, Ballarat Victoria, Australia;University of Ballarat, Ballarat Victoria, Australia;University of Ballarat, Ballarat Victoria, Australia
Venue:
AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Year:
2007

Citing 5
Cited 0

Introduction

Communications of the ACM - The Blogosphere
Why we blog

Communications of the ACM - The Blogosphere
Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Pattern mining across domain-specific text collections

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Opinion detection research relies on labeled documents for training data, either by assumptions based on the document's origin or by using human assessors to categorise the documents. In recent years, blogs have become a source for opinion identification research (TREC Blog06). This study analyses the part-of-speech proportion and the words used within various corpora, determining key differences and similarities useful when preparing for opinion identification research. The resulting comparisons between the characteristics of the various corpora is detailed and discussed. In particular, opinion-bearing and nonopinion Blog06 documents were found to display a high level of similarity, indicating that blog documents assessed at the document level cannot be used as training data in opinion identification research.