BoosTexter: A Boosting-based Systemfor Text Categorization
Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Detecting errors in part-of-speech annotation
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Detecting errors in corpora using support vector machines
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Correcting category errors in text classification
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
MP-Boost: a multiple-pivot boosting algorithm and its application to text categorization
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Collaborative data cleaning for sentiment classification with noisy training corpus
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
An experimental study of constrained clustering effectiveness in presence of erroneous constraints
Information Processing and Management: an International Journal
A utility-theoretic ranking method for semi-automated text classification
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification Accuracy by Training Label Cleaning
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain; strategies are thus needed for maximizing the effectiveness of the resulting classifiers while minimizing the required amount of training effort. Training data cleaning (TDC) consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has misclassified them, thereby providing a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods we present three different techniques for performing TDC and, on two widely used TC benchmarks, evaluate them by their capability of spotting misclassified texts purposefully inserted in the training set.