Hellinger distance decision trees are robust and skew-insensitive

Authors:
David A. Cieslak;T. Ryan Hoens;Nitesh V. Chawla;W. Philip Kegelmeyer
Affiliations:
University of Notre Dame, Notre Dame, USA;University of Notre Dame, Notre Dame, USA;University of Notre Dame, Notre Dame, USA;University of Notre Dame, Notre Dame, USA
Venue:
Data Mining and Knowledge Discovery
Year:
2012

Citing 24
Cited 4

Bagging predictors

Machine Learning
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Machine Learning
Random Forests

Machine Learning
Induction of Decision Trees

Machine Learning
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Quantification of Distance Bias Between Evaluation Metrics In Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Tree Induction for Probability-Based Ranking

Machine Learning
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms

Neural Computation
A Comparison of Decision Tree Ensemble Creation Techniques

IEEE Transactions on Pattern Analysis and Machine Intelligence
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
Learning Decision Trees for Unbalanced Data

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
COG: local decomposition for rare class analysis

Data Mining and Knowledge Discovery
Analyzing PETs on imbalanced datasets when training and testing class distributions differ

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches

Knowledge-Based Systems
Classification and outlier detection based on topic based pattern synthesis

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Editorial: Parameter-free classification in multi-class imbalanced data sets

Data & Knowledge Engineering
Engagement vs performance: using electronic portfolios to predict first semester engineering student retention

Proceedings of the Fourth International Conference on Learning Analytics And Knowledge

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online ( http://www.nd.edu/~dial/hddt ).