C4.5 consolidation process: an alternative to intelligent oversampling methods in class imbalance problems

Authors:
Iñaki Albisua;Olatz Arbelaitz;Ibai Gurrutxaga;Javier Muguerza;Jesús M. Pérez
Affiliations:
Computer Science Faculty, University of the Basque Country, Donostia, Spain;Computer Science Faculty, University of the Basque Country, Donostia, Spain;Computer Science Faculty, University of the Basque Country, Donostia, Spain;Computer Science Faculty, University of the Basque Country, Donostia, Spain;Computer Science Faculty, University of the Basque Country, Donostia, Spain
Venue:
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Year:
2011

Citing 22
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Mastering Data Mining: The Art and Science of Customer Relationship Management

Mastering Data Mining: The Art and Science of Customer Relationship Management
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
One-class svms for document classification

The Journal of Machine Learning Research
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Combining multiple class distribution modified subsamples in a single tree

Pattern Recognition Letters
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
The class imbalance problem: A systematic study

Intelligent Data Analysis
Top 10 algorithms in data mining

Knowledge and Information Systems
Maximizing the area under the ROC curve by pairwise feature combination

Pattern Recognition
Evolutionary rule-based systems for imbalanced data sets

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems

Applied Soft Computing
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power

Information Sciences: an International Journal
AUC: a better measure than accuracy in comparing learning algorithms

AI'03 Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence
Obtaining optimal class distribution for decision trees: comparative analysis of CTC and C4.5

CAEPIA'09 Proceedings of the Current topics in artificial intelligence, and 13th conference on Spanish association for artificial intelligence
Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In real world problems solved using data mining techniques, it is very usual to find data in which the number of examples of one of the classes is much smaller than the number of examples of the rest of the classes. Many works have been done to deal with these problems known as class imbalance problems. Most of them focus their effort on data resampling techniques so that training data would be improved, usually balancing the classes, before using a classical learning algorithm. Another option is to propose modifications to the learning algorithm. As a mixture of these two options, we proposed the Consolidation process, based on a previous resampling of the training data and a modification of the learning algorithm, in this study the C4.5. In this work, we experimented with 14 databases and compared the effectiveness of each strategy based on the achieved AUC values. Results show that the consolidation obtains the best performance compared to five well-known resampling methods including SMOTE and some of its variants. Thus, the consolidation process combined with subsamples to balance the class distribution is appropriate for class imbalance problems requiring explanation and high discriminating capacity.