Sequential testing in classifier evaluation yields biased estimates of effectiveness

Authors:
William Webber;Mossaab Bagdouri;David D. Lewis;Douglas W. Oard
Affiliations:
University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA;David D. Lewis Consulting, Chicago, IL, USA;University of Maryland, College Park, MD, USA
Venue:
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Year:
2013

Citing 8
Cited 1

Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Rigorous learning curve bounds from statistical mechanics

Machine Learning - Special issue on COLT '94
Algorithmic stability and sanity-check bounds for leave-one-out cross-validation

Neural Computation
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A probabilistic interpretation of precision, recall and F-score, with implication for evaluation

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Bibliography on estimation of misclassification

IEEE Transactions on Information Theory
Approximate Recall Confidence Intervals

ACM Transactions on Information Systems (TOIS)

Towards minimizing the annotation cost of certified text classification

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is common to develop and validate classifiers through a process of repeated testing, with nested training and/or test sets of increasing size. We demonstrate in this paper that such repeated testing leads to biased estimates of classifier effectiveness. Experiments on a range of text classification tasks under three sequential testing frameworks show all three lead to optimistic estimates of effectiveness. We calculate empirical adjustments to unbias estimates on our data set, and identify directions for research that could lead to general techniques for avoiding bias while reducing labeling costs.