Training Data Cleaning for Text Classification

Authors:
Andrea Esuli;Fabrizio Sebastiani
Affiliations:
Istituto di Scienza e Tecnologia dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy 56124;Istituto di Scienza e Tecnologia dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy 56124
Venue:
ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Year:
2009

Citing 7
Cited 4

BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Detecting errors in part-of-speech annotation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Detecting errors in corpora using support vector machines

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Correcting category errors in text classification

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
MP-Boost: a multiple-pivot boosting algorithm and its application to text categorization

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Collaborative data cleaning for sentiment classification with noisy training corpus

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
An experimental study of constrained clustering effectiveness in presence of erroneous constraints

Information Processing and Management: an International Journal
A utility-theoretic ranking method for semi-automated text classification

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain; strategies are thus needed for maximizing the effectiveness of the resulting classifiers while minimizing the required amount of training effort. Training data cleaning (TDC) consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has misclassified them, thereby providing a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods we present three different techniques for performing TDC and, on two widely used TC benchmarks, evaluate them by their capability of spotting misclassified texts purposefully inserted in the training set.