Generating balanced classifier-independent training samples from unlabeled data

Authors:
Youngja Park;Zijie Qi;Suresh N. Chari;Ian M. Molloy
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY;University of California Davis, Davis, CA;IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2012

Citing 21
Cited 0

Elements of information theory

Elements of information theory
Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Query Learning with Large Margin Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Clustering with Instance-level Constraints

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning and evaluating classifiers under sample selection bias

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Learning a Mahalanobis Metric from Equivalence Constraints

The Journal of Machine Learning Research
Batch mode active learning and its application to medical image classification

ICML '06 Proceedings of the 23rd international conference on Machine learning
YALE: rapid prototyping for complex data mining tasks

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning on the border: active learning in imbalanced data classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hierarchical sampling for active learning

Proceedings of the 25th international conference on Machine learning
Reducing class imbalance during active learning for named entity annotation

Proceedings of the fifth international conference on Knowledge capture
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm

ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of generating balanced training samples from an unlabeled data set with an unknown class distribution. While random sampling works well when the data is balanced, it is very ineffective for unbalanced data. Other approaches, such as active learning and cost-sensitive learning, are also suboptimal as they are classifier-dependent, and require misclassification costs and labeled samples. We propose a new strategy for generating training samples which is independent of the underlying class distribution of the data and the classifier that will be trained using the labeled data. Our methods are iterative and can be seen as variants of active learning, where we use semi-supervised clustering at each iteration to perform biased sampling from the clusters. Several strategies are provided to estimate the underlying class distributions in the clusters and increase the balancedness in the training samples. Experiments with both highly skewed and balanced data from the UCI repository and a private data show that our algorithm produces much more balanced samples than random sampling or uncertainty sampling. Further, our sampling strategy is substantially more efficient than active learning methods. The experiments also validate that, with more balanced training data, classifiers trained with our samples outperform classifiers trained with random sampling or active learning.