Learning Decision Trees for Unbalanced Data

Authors:
David A. Cieslak;Nitesh V. Chawla
Affiliations:
University of Notre Dame, Notre Dame, USA IN 46556;University of Notre Dame, Notre Dame, USA IN 46556
Venue:
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Year:
2008

Citing 15
Cited 12

Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Induction of Decision Trees

Machine Learning
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Quantification of Distance Bias Between Evaluation Metrics In Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Tree Induction for Probability-Based Ranking

Machine Learning
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Classification and knowledge discovery in protein databases

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Adaptive methods for classification in arbitrarily imbalanced and drifting data streams

PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
Human mobility, social ties, and link prediction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Class confidence weighted kNN algorithms for imbalanced data sets

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Ensembles of decision trees for imbalanced data

MCS'11 Proceedings of the 10th international conference on Multiple classifier systems
Using model trees and their ensembles for imbalanced data

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Hellinger distance decision trees are robust and skew-insensitive

Data Mining and Knowledge Discovery
Generating diverse ensembles to counter the problem of class imbalance

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Building decision trees for the multi-class imbalance problem

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Feature selection for high-dimensional imbalanced data

Neurocomputing
Robust pixel-based classification of obstacles for robotic harvesting of sweet-pepper

Computers and Electronics in Agriculture
Classification and outlier detection based on topic based pattern synthesis

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Training and assessing classification rules with imbalanced data

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria --- information gain, Gini measure, and DKM --- and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, proposes its application in forming decision trees, and performs a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.