Improving Text Classification Accuracy by Training Label Cleaning

Authors:
Andrea Esuli;Fabrizio Sebastiani
Affiliations:
Consiglio Nazionale delle Ricerche, Italy;Consiglio Nazionale delle Ricerche, Italy
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2013

Citing 34
Cited 0

Neural networks and the bias/variance dilemma

Neural Computation
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Bagging predictors

Machine Learning
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Information, Prediction, and Query by Committee

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
Detecting errors within a corpus using anomaly detection

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Correction of errors in a verb modality corpus for machine translation with a machine-learning method

ACM Transactions on Asian Language Information Processing (TALIP)
Detecting errors in part-of-speech annotation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Detecting errors in corpora using support vector machines

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Noisy Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Large scale semi-supervised linear SVMs

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Correcting category errors in text classification

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
An algorithm for correcting mislabeled data

Intelligent Data Analysis
trNon-greedy active learning for text categorization using convex ansductive experimental design

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How Much Noise Is Too Much: A Study in Automatic Text Classification

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Training Data Cleaning for Text Classification

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Introduction to Semi-Supervised Learning

Introduction to Semi-Supervised Learning
Crowdsourcing document relevance assessment with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
An empirical evaluation of bagging and boosting

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
MP-Boost: a multiple-pivot boosting algorithm and its application to text categorization

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Automatic Training Data Cleaning for Text Classification

ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops
Detecting and revising misclassifications using ILP

DS'05 Proceedings of the 8th international conference on Discovery Science
Boosting: Foundations and Algorithms

Boosting: Foundations and Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting classifiers for a given amount of training effort. Both strategies have been actively investigated for TC in recent years. Much less research has been devoted to a third such strategy, training label cleaning (TLC), which consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has mislabelled them. This provides a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods for multilabel classification we present three different techniques for performing TLC and, on three widely used TC benchmarks, evaluate them by their capability of spotting training documents that, for experimental reasons only, we have purposefully mislabelled. We also evaluate the degradation in classification effectiveness that these mislabelled texts bring about, and to what extent training label cleaning can prevent this degradation.