Making generative classifiers robust to selection bias

Authors:
Andrew T. Smith;Charles Elkan
Affiliations:
University of California: San Diego;University of California: San Diego
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 9
Cited 5

Statistical analysis with missing data

Statistical analysis with missing data
Importance sampling for stochastic simulations

Management Science
Transforming classifier scores into accurate multiclass probability estimates

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamic bayesian networks: representation, inference and learning

Dynamic bayesian networks: representation, inference and learning
A Bayesian network framework for reject inference

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and evaluating classifiers under sample selection bias

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Discriminative learning for differing training and test distributions

Proceedings of the 24th international conference on Machine learning
Asymptotic Bayesian generalization error when training and test distributions are different

Proceedings of the 24th international conference on Machine learning
Model selection under covariate shift

ICANN'05 Proceedings of the 15th international conference on Artificial neural networks: formal models and their applications - Volume Part II

Learning classifiers from only positive and unlabeled data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Transfer learning from multiple source domains via consensus regularization

Proceedings of the 17th ACM conference on Information and knowledge management
Preserving privacy in data mining via importance weighting

PSDML'10 Proceedings of the international ECML/PKDD conference on Privacy and security issues in data mining and machine learning
Leakage in data mining: Formulation, detection, and avoidance

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
Partial imputation of unseen records to improve classification using a hybrid multi-layered artificial immune system and genetic algorithm

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents approaches to semi-supervised learning when the labeled training data and test data are differently distributed. Specifically, the samples selected for labeling are a biased subset of some general distribution and the test set consists of samples drawn from either that general distribution or the distribution of the unlabeled samples. An example of the former appears in loan application approval, where samples with repay/default labels exist only for approved applicants and the goal is to model the repay/default behavior of all applicants. An example of the latter appears in spam filtering, in which the labeled samples can be out-dated due to the cost of labeling email by hand, but an unlabeled set of up-to-date emails exists and the goal is to build a filter to sort new incoming email.Most approaches to overcoming such bias in the literature rely on the assumption that samples are selected for labeling depending only on the features, not the labels, a case in which provably correct methods exist. The missing labels are said to be "missing at random" (MAR). In real applications, however, the selection bias can be more severe. When the MAR conditional independence assumption is not satisfied and missing labels are said to be "missing not at random" (MNAR), and no learning method is provably always correct.We present a generative classifier, the shifted mixture model (SMM), with separate representations of the distributions of the labeled samples and the unlabeled samples. The SMM makes no conditional independence assumptions and can model distributions of semi-labeled data sets with arbitrary bias in the labeling. We present a learning method based on the expectation maximization (EM) algorithm that, while not always able to overcome arbitrary labeling bias, learns SMMs with higher test-set accuracy in real-world data sets (with MNAR bias) than existing learning methods that are proven to overcome MAR bias.