A Bayesian network framework for reject inference

Authors:
Andrew Smith;Charles Elkan
Affiliations:
University of California - San Diego, La Jolla, CA;University of California - San Diego, La Jolla, CA
Venue:
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2004

Citing 6
Cited 10

Information, Prediction, and Query by Committee

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Transforming classifier scores into accurate multiclass probability estimates

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamic bayesian networks: representation, inference and learning

Dynamic bayesian networks: representation, inference and learning
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Learning and evaluating classifiers under sample selection bias

ICML '04 Proceedings of the twenty-first international conference on Machine learning
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Making generative classifiers robust to selection bias

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
More bang for their bucks: assessing new features for online advertisers

ACM SIGKDD Explorations Newsletter - Special issue on visual analytics
More bang for their bucks: assessing new features for online advertisers

Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising
Learning classifiers from only positive and unlabeled data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable pattern mining with Bayesian networks as background knowledge

Data Mining and Knowledge Discovery
Decision support and profit prediction for online auction sellers

Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data
Evaluating online ad campaigns in a pipeline: causal models at scale

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Ontology and instance matching

Knowledge-driven multimedia information extraction and ontology evolution
Differential privacy based on importance weighting

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most learning methods assume that the training set is drawn randomly from the population to which the learned model is to be applied. However in many applications this assumption is invalid. For example, lending institutions create models of who is likely to repay a loan from training sets consisting of people in their records to whom loans were given in the past; however, the institution approved loan applications previously based on who was thought unlikely to default. Learning from only approved loans yields an incorrect model because the training set is a biased sample of the general population of applicants. The issue of including rejected samples in the learning process, or alternatively using rejected samples to adjust a model learned from accepted samples only, is called reject inference.The main contribution of this paper is a systematic analysis of different cases that arise in reject inference, with explanations of which cases arise in various real-world situations. We use Bayesian networks to formalize each case as a set of conditional independence relationships and identify eight cases, including the familiar missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) cases. For each case we present an overview of available learning algorithms. These algorithms have been published in separate fields of research, including epidemiology, econometrics, clinical trial evaluation, sociology, and credit scoring; our second major contribution is to describe these algorithms in a common framework.