A comparison of different off-centered entropies to deal with class imbalance for decision trees

Authors:
Philippe Lenca;Stéphane Lallich;Thanh-Nghi Do;Nguyen-Khang Pham
Affiliations:
Institut TELECOM, TELECOM Bretagne, Lab-STICC, Brest, France;Université Lyon, Laboratoire ERIC, Lyon 2, Lyon, France;INRIA Futurs, LRI, Université de Paris-Sud, Orsay, France;IRISA, Rennes, France
Venue:
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2008

Citing 9
Cited 3

C4.5: programs for machine learning

C4.5: programs for machine learning
Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Induction of Decision Trees

Machine Learning
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Decision trees with minimal costs

ICML '04 Proceedings of the twenty-first international conference on Machine learning
The class imbalance problem: A systematic study

Intelligent Data Analysis
On multi-class cost-sensitive learning

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research

Multi-agent based classification using argumentation from experience

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Large scale visual classification with many classes

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Editorial: Parameter-free classification in multi-class imbalanced data sets

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In data mining, large differences in prior class probabilities known as the class imbalance problem have been reported to hinder the performance of classifiers such as decision trees. Dealing with imbalanced and cost-sensitive data has been recognized as one of the 10 most challenging problems in data mining research. In decision trees learning, many measures are based on the concept of Shannon's entropy. A major characteristic of the entropies is that they take their maximal value when the distribution of the modalities of the class variable is uniform. To deal with the class imbalance problem, we proposed an off-centered entropy which takes its maximum value for a distribution fixed by the user. This distribution can be the a priori distribution of the class variable modalities or a distribution taking into account the costs of misclassification. Others authors have proposed an asymmetric entropy. In this paper we present the concepts of the three entropies and compare their effectiveness on 20 imbalanced data sets. All our experiments are founded on the C4.5 decision trees algorithm, in which only the function of entropy is modified. The results are promising and show the interest of off-centered entropies to deal with the problem of class imbalance.