Collaborative data cleaning for sentiment classification with noisy training corpus

Authors:
Xiaojun Wan
Affiliations:
Institute of Computer Science and Technology, The MOE Key Laboratory of Computational Linguistics, Peking University, Beijing, China
Venue:
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Year:
2011

Citing 23
Cited 0

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Data mining: concepts and techniques

Data mining: concepts and techniques
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Detecting errors within a corpus using anomaly detection

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Correction of errors in a verb modality corpus for machine translation with a machine-learning method

ACM Transactions on Asian Language Information Processing (TALIP)
Detecting errors in part-of-speech annotation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Detecting errors in corpora using support vector machines

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Thumbs up?: sentiment classification using machine learning techniques

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Deeper sentiment analysis using machine translation technology

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Correcting category errors in text classification

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Determining the sentiment of opinions

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Recognizing contextual polarity in phrase-level sentiment analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Training Data Cleaning for Text Classification

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Multilingual subjectivity analysis using machine translation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Using bilingual knowledge and ensemble techniques for unsupervised Chinese sentiment analysis

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Using emoticons to reduce dependency in machine learning techniques for sentiment classification

ACLstudent '05 Proceedings of the ACL Student Research Workshop
Co-training for cross-lingual sentiment classification

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Discovering the discriminative views: measuring term weights for sentiment analysis

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Labeled review corpus is considered as a very valuable resource for the task of sentiment classification of product reviews. Fortunately, there are a large amount of product reviews on the Web, and each review is associated with a tag assigned by users to indicate its polarity orientation. We can download such reviews with tags and use them as training corpus for sentiment classification. However, users may assign the polarity tag arbitrarily and inaccurately, and some tags are not appropriate, which results in that the automatically constructed corpus contains many noises and the noisy instances will deteriorate the classification performance. In this paper, we propose the co-cleaning and tri-cleaning algorithms to collaboratively clean the corpus and thus improve the sentiment classification performance. The proposed algorithms use multiple classifiers to iteratively select and remove the most confidently noisy instances from the corpus. Experimental results verify the effectiveness of our proposed algorithms, and the tri-cleaning algorithm is most effective and promising.