Obtaining optimal class distribution for decision trees: comparative analysis of CTC and C4.5

Authors:
Iñaki Albisua;Olatz Arbelaitz;Ibai Gurrutxaga;José I. Martín;Javier Muguerza
Affiliations:
Dept. of Computer Architecture and Technology, University of the Basque Country, Donostia, Spain;Dept. of Computer Architecture and Technology, University of the Basque Country, Donostia, Spain;Dept. of Computer Architecture and Technology, University of the Basque Country, Donostia, Spain;Dept. of Computer Architecture and Technology, University of the Basque Country, Donostia, Spain;Dept. of Computer Architecture and Technology, University of the Basque Country, Donostia, Spain
Venue:
CAEPIA'09 Proceedings of the Current topics in artificial intelligence, and 13th conference on Spanish association for artificial intelligence
Year:
2009

Citing 10
Cited 1

C4.5: programs for machine learning

C4.5: programs for machine learning
Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Combining multiple class distribution modified subsamples in a single tree

Pattern Recognition Letters
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
The class imbalance problem: A systematic study

Intelligent Data Analysis
Maximizing the area under the ROC curve by pairwise feature combination

Pattern Recognition
Evolutionary rule-based systems for imbalanced data sets

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
AUC: a better measure than accuracy in comparing learning algorithms

AI'03 Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence

C4.5 consolidation process: an alternative to intelligent oversampling methods in class imbalance problems

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

When using machine learning to solve real world problems, the class distribution used in the training set is important; not only in highly unbalanced data sets but in every data set. Weiss and Provost suggested that each domain has an optimal class distribution to be used for training. The aim of this work was to analyze the truthfulness of this hypothesis in the context of decision tree learners. With this aim we found the optimal class distribution for 30 databases and two decision tree learners, C4.5 and Consolidated Tree Construction algorithm (CTC), taking into account pruned and unpruned trees and based on two measures for evaluating discriminating capacity: AUC and error. The results confirmed that changes in the class distribution of the training samples improve the performance (AUC and error) of the classifiers. Therefore, the experimentation showed that there is an optimal class distribution for each database and this distribution depends on the used learning algorithm, whether the trees are pruned or not and the used evaluation criteria. Besides, results showed that CTC algorithm combined with optimal class distribution samples achieves more accurate learners, than any of the options of C4.5 and CTC with original distribution, with statistically significant differences.