Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

Authors:
Wei Fan;Ian Davidson
Affiliations:
IBM T. J. Watson Research, Hawthorne, NY;University of Albany, State University of New York, Albany, NY
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 8
Cited 2

Statistical analysis with missing data

Statistical analysis with missing data
The nature of statistical learning theory

The nature of statistical learning theory
A framework for structural risk minimisation

COLT '96 Proceedings of the ninth annual conference on Computational learning theory
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning

Machine Learning
A Bayesian network framework for reject inference

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and evaluating classifiers under sample selection bias

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An Improved Categorization of Classifier's Sensitivity on Sample Selection Bias

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining

Leveraging Web 2.0 Sources for Web Content Classification

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Cross validation framework to choose amongst models and datasets for transfer learning

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the so-called "stationary distribution assumption" that the future and the past data sets are identical from a probabilistic standpoint. In many domains of real-world applications, such as marketing solicitation, fraud detection, drug testing, loan approval, sub-population surveys, school enrollment among others, this is rarely the case. This is because the only labeled sample available for training is biased in different ways due to a variety of practical reasons and limitations. In these circumstances, traditional methods to evaluate the expected generalization error of classification algorithms, such as structural risk minimization, ten-fold cross-validation, and leave-one-out validation, usually return poor estimates of which classification algorithm, when trained on biased dataset, will be the most accurate for future unbiased dataset, among a number of competing candidates. Sometimes, the estimated order of the learning algorithms' accuracy could be so poor that it is not even better than random guessing. Therefore,a method to determine the most accurate learner is needed for data mining under sample selection bias for many real-world applications. We present such an approach that can determine which learner will perform the best on an unbiased test set, given a possibly biased training set, in a fraction of the computational cost to use cross-validation based approaches.