Building decision trees for the multi-class imbalance problem

Authors:
T. Ryan Hoens;Qi Qian;Nitesh V. Chawla;Zhi-Hua Zhou
Affiliations:
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN;National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China;Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN;National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Venue:
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2012

Citing 15
Cited 1

Bagging predictors

Machine Learning
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An Instance-Weighting Method to Induce Cost-Sensitive Trees

IEEE Transactions on Knowledge and Data Engineering
Induction of Decision Trees

Machine Learning
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Tree Induction for Probability-Based Ranking

Machine Learning
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem

IEEE Transactions on Knowledge and Data Engineering
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Learning Decision Trees for Unbalanced Data

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
On multi-class cost-sensitive learning

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Consequences of Variability in Classifier Performance Estimates

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining

Early prediction on imbalanced multivariate time series

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning in imbalanced datasets is a pervasive problem prevalent in a wide variety of real-world applications. In imbalanced datasets, the class of interest is generally a small fraction of the total instances, but misclassification of such instances is often expensive. While there is a significant body of research on the class imbalance problem for binary class datasets, multi-class datasets have received considerably less attention. This is partially due to the fact that the multi-class imbalance problem is often much harder than its related binary class problem, as the relative frequency and cost of each of the classes can vary widely from dataset to dataset. In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.