Negative training data can be harmful to text classification

Authors:
Xiao-Li Li;Bing Liu;See-Kiong Ng
Affiliations:
Institute for Infocomm Research, Connexis Singapore;University of Illinois at Chicago, Chicago, IL;Institute for Infocomm Research, Connexis Singapore
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 26
Cited 6

The effect of adding relevance information in a relevance feedback environment

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A sequential algorithm for training text classifiers: corrigendum and additional data

ACM SIGIR Forum
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Measurement-theoretical investigation of the MZ-metric

SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
PAC Learning from Positive Statistical Queries

ALT '98 Proceedings of the 9th International Conference on Algorithmic Learning Theory
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Learning and evaluating classifiers under sample selection bias

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Learning classifiers from only positive and unlabeled data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Active sample selection for named entity transliteration

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Multi-domain sentiment classification

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Supervised domain adaption for WSD

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Online methods for multi-domain learning and adaptation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Locating complex named entities in web text

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Learning to identify unexpected instances in the test set

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Graph ranking for sentiment transfer

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Heterogeneous transfer learning for image clustering via the social web

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Domain adaptive bootstrapping for named entity recognition

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Discriminative Learning Under Covariate Shift

The Journal of Machine Learning Research
A Survey on Transfer Learning

IEEE Transactions on Knowledge and Data Engineering
Distributional similarity vs. PU learning for entity set expansion

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Learning from positive and unlabeled examples with different data distributions

ECML'05 Proceedings of the 16th European conference on Machine Learning

Entity set expansion using topic information

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
On positive and unlabeled learning for text classification

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Building high-performance classifiers using positive and unlabeled examples for text classification

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part II
A parallel genetic programming for single class classification

Proceedings of the 15th annual conference companion on Genetic and evolutionary computation
IFME: information filtering by multiple examples with under-sampling in a digital library environment

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
What users care about: a framework for social content alignment

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been conducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.