Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data

Authors:
Carlos Márquez-Vera;Alberto Cano;Cristóbal Romero;Sebastián Ventura
Affiliations:
Autonomous University of Zacatecas, Zacatecas, México;Department of Computer Science, University of Córdoba, Córdoba, Spain;Department of Computer Science, University of Córdoba, Córdoba, Spain;Department of Computer Science, University of Córdoba, Córdoba, Spain
Venue:
Applied Intelligence
Year:
2013

Citing 20
Cited 3

C4.5: programs for machine learning

C4.5: programs for machine learning
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
The Alternating Decision Tree Learning Algorithm

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Handbook of data mining and knowledge discovery

Handbook of data mining and knowledge discovery
Grammatical bias for evolutionary learning

Grammatical bias for evolutionary learning
Predicting Students' Marks in Hellenic Open University

ICALT '05 Proceedings of the Fifth IEEE International Conference on Advanced Learning Technologies
Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification

ICMLA '06 Proceedings of the 5th International Conference on Machine Learning and Applications
Data Mining on Imbalanced Data Sets

ICACTE '08 Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering
Educational data mining: a case study for predicting dropout-prone students

International Journal of Knowledge Engineering and Soft Data Paradigms
Two decades of ripple down rules research

The Knowledge Engineering Review
Dropout prediction in e-learning courses through the combination of machine learning techniques

Computers & Education
Factors influencing university drop out rates

Computers & Education
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
A combinational incremental ensemble of classifiers as a technique for predicting students' performance in distance education

Knowledge-Based Systems
Educational data mining: a review of the state of the art

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Data Mining: Practical Machine Learning Tools and Techniques

Data Mining: Practical Machine Learning Tools and Techniques
The use of genetic programming for the construction of a financial management model in an enterprise

Applied Intelligence
Improving classification performance of Support Vector Machine by genetically optimising kernel shape and hyper-parameters

Applied Intelligence

Predicting students' final performance from participation in on-line discussion forums

Computers & Education
Strategies for avoiding preference profiling in agent-based e-commerce environments

Applied Intelligence
Intelligent churn prediction in telecom: employing mRMR feature selection and RotBoost based ensemble classification

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Predicting student failure at school has become a difficult challenge due to both the high number of factors that can affect the low performance of students and the imbalanced nature of these types of datasets. In this paper, a genetic programming algorithm and different data mining approaches are proposed for solving these problems using real data about 670 high school students from Zacatecas, Mexico. Firstly, we select the best attributes in order to resolve the problem of high dimensionality. Then, rebalancing of data and cost sensitive classification have been applied in order to resolve the problem of classifying imbalanced data. We also propose to use a genetic programming model versus different white box techniques in order to obtain both more comprehensible and accuracy classification rules. The outcomes of each approach are shown and compared in order to select the best to improve classification accuracy, specifically with regard to which students might fail.