An experimental comparison of classification algorithms for imbalanced credit scoring data sets

Authors:
Iain Brown;Christophe Mues
Affiliations:
School of Management, University of Southampton, Highfield, Southampton SO17 1BJ, UK;School of Management, University of Southampton, Highfield, Southampton SO17 1BJ, UK
Venue:
Expert Systems with Applications: An International Journal
Year:
2012

Citing 11
Cited 4

C4.5: programs for machine learning

C4.5: programs for machine learning
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Neural network credit scoring models

Computers and Operations Research - Neural networks in business
Random Forests

Machine Learning
Stochastic gradient boosting

Computational Statistics & Data Analysis - Nonlinear methods and data mining
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings

IEEE Transactions on Software Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Data Mining: Practical Machine Learning Tools and Techniques

Data Mining: Practical Machine Learning Tools and Techniques

Improving risk predictions by preprocessing imbalanced credit data

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part II
A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios

Pattern Recognition Letters
A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition

Expert Systems with Applications: An International Journal
Addressing imbalanced classification with instance generation techniques: IPADE-ID

Neurocomputing

Quantified Score

Hi-index	12.05

Visualization

Abstract

In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regression, neural networks and decision trees, this paper will also explore the suitability of gradient boosting, least square support vector machines and random forests for loan default prediction. Five real-world credit scoring data sets are used to build classifiers and test their performance. In our experiments, we progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the respective techniques is adversely affected. The performance criterion chosen to measure this effect is the area under the receiver operating characteristic curve (AUC); Friedman's statistic and Nemenyi post hoc tests are used to test for significance of AUC differences between techniques. The results from this empirical study indicate that the random forest and gradient boosting classifiers perform very well in a credit scoring context and are able to cope comparatively well with pronounced class imbalances in these data sets. We also found that, when faced with a large class imbalance, the C4.5 decision tree algorithm, quadratic discriminant analysis and k-nearest neighbours perform significantly worse than the best performing classifiers.