C4.5: programs for machine learning
C4.5: programs for machine learning
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Neural network credit scoring models
Computers and Operations Research - Neural networks in business
Machine Learning
Computational Statistics & Data Analysis - Nonlinear methods and data mining
A study of the behavior of several methods for balancing machine learning training data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Statistical Comparisons of Classifiers over Multiple Data Sets
The Journal of Machine Learning Research
IEEE Transactions on Software Engineering
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
Data Mining: Practical Machine Learning Tools and Techniques
Data Mining: Practical Machine Learning Tools and Techniques
Improving risk predictions by preprocessing imbalanced credit data
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part II
Pattern Recognition Letters
Expert Systems with Applications: An International Journal
Hi-index | 12.05 |
In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regression, neural networks and decision trees, this paper will also explore the suitability of gradient boosting, least square support vector machines and random forests for loan default prediction. Five real-world credit scoring data sets are used to build classifiers and test their performance. In our experiments, we progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the respective techniques is adversely affected. The performance criterion chosen to measure this effect is the area under the receiver operating characteristic curve (AUC); Friedman's statistic and Nemenyi post hoc tests are used to test for significance of AUC differences between techniques. The results from this empirical study indicate that the random forest and gradient boosting classifiers perform very well in a credit scoring context and are able to cope comparatively well with pronounced class imbalances in these data sets. We also found that, when faced with a large class imbalance, the C4.5 decision tree algorithm, quadratic discriminant analysis and k-nearest neighbours perform significantly worse than the best performing classifiers.