Active Evaluation of Classifiers on Large Datasets

Authors:
Namit Katariya;Arun Iyer;Sunita Sarawagi
Affiliations:
-;-;-
Venue:
ICDM '12 Proceedings of the 2012 IEEE 12th International Conference on Data Mining
Year:
2012

Citing 0
Cited 2

Data-based research at IIT Bombay

ACM SIGMOD Record
How the live web feels about events

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of this work is to estimate the accuracy of a classifier on a large unlabeled dataset based on a small labeled set and a human labeler. We seek to estimate accuracy and select instances for labeling in a loop via a continuously refined stratified sampling strategy. For stratifying data we develop a novel strategy of learning r bit hash functions to preserve similarity in accuracy values. We show that our algorithm provides better accuracy estimates than existing methods for learning distance preserving hash functions. Experiments on a wide spectrum of real datasets show that our estimates achieve between 15% and 62% relative reduction in error compared to existing approaches. We show how to perform stratified sampling on unlabeled data that is so large that in an interactive setting even a single sequential scan is impractical. We present an optimal algorithm for performing importance sampling on a static index over the data that achieves close to exact estimates while reading three orders of magnitude less data.