A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition

Authors:
Dech Thammasiri;Dursun Delen;Phayung Meesad;Nihat Kasap
Affiliations:
Faculty of Information Technology, King Mongkut's University of Technology North Bangkok Bangsue, Bangkok 10800, Thailand;Spears School of Business, Department of Management Science and Information Systems, Oklahoma State University, Tulsa, OK 74106, USA;Faculty of Information Technology, King Mongkut's University of Technology North Bangkok Bangsue, Bangkok 10800, Thailand;School of Management, Sabanci University, Istanbul 34956, Turkey
Venue:
Expert Systems with Applications: An International Journal
Year:
2014

Citing 18
Cited 1

Induction of Decision Trees

Machine Learning
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem

IEEE Transactions on Knowledge and Data Engineering
Classification of weld flaws with imbalanced class data

Expert Systems with Applications: An International Journal
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
A learning method for the class imbalance problem with medical data sets

Computers in Biology and Medicine
Advanced Data Mining Techniques

Advanced Data Mining Techniques
A comparative analysis of machine learning techniques for student retention management

Decision Support Systems
Comparing alternative classifiers for database marketing: The case of imbalanced datasets

Expert Systems with Applications: An International Journal
An experimental comparison of classification algorithms for imbalanced credit scoring data sets

Expert Systems with Applications: An International Journal
Comparative analysis of data mining methods for bankruptcy prediction

Decision Support Systems
A Kolmogorov-Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction

Expert Systems with Applications: An International Journal
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Data mining for student retention management

Journal of Computing Sciences in Colleges
Prediction of liquefaction potential based on CPT up-sampling

Computers & Geosciences
Learning patterns of university student retention

Expert Systems with Applications: An International Journal
Mining academic data to improve college student retention: an open source perspective

Proceedings of the 2nd International Conference on Learning Analytics and Knowledge

Engagement vs performance: using electronic portfolios to predict first semester engineering student retention

Proceedings of the Fourth International Conference on Learning Analytics And Knowledge

Quantified Score

Hi-index	12.05

Visualization

Abstract

Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques-over-sampling, under-sampling and synthetic minority over-sampling (SMOTE)-along with four popular classification methods-logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates.