Analyzing PETs on imbalanced datasets when training and testing class distributions differ

Authors:
David Cieslak;Nitesh Chawla
Affiliations:
University of Notre Dame, Notre Dame, IN;University of Notre Dame, Notre Dame, IN
Venue:
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2008

Citing 8
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Tree Induction for Probability-Based Ranking

Machine Learning
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An empirical comparison of supervised learning algorithms

ICML '06 Proceedings of the 23rd international conference on Machine learning
Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Ensembles of probability estimation trees for customer churn prediction

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part II
Hellinger distance decision trees are robust and skew-insensitive

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many machine learning applications like finance, medicine, and risk management suffer from class imbalance: cases of interest occur rarely. Further complicating these applications is that the training and testing samples might differ significantly in their respective class distributions. Sampling has been shown to be a strong solution to imbalance and additionally offers a rich parameter space from which to select classifiers. This paper is concerned with the interaction between Probability Estimation Trees (PETs) [1], sampling, and performance metrics as testing distributions fluctuate substantially. A set of comprehensive analyses is presented, which anticipate classifier performance through a set of widely varying testing distributions.