Improving software-quality predictions with data sampling and boosting

Authors:
Chris Seiffert;Taghi M. Khoshgoftaar;Jason Van Hulse
Affiliations:
Florida Atlantic University, Boca Raton, FL;Florida Atlantic University, Boca Raton, FL;Florida Atlantic University, Boca Raton, FL
Venue:
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Year:
2009

Citing 28
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
A neural network approach for early detection of program modules having high risk in the maintenance phase

Selected papers of the sixth annual Oregon workshop on Software metrics
Lazy learning

Lazy learning
Robust Classification for Imprecise Environments

Machine Learning
Balancing Misclassification Rates in Classification-TreeModels of Software Quality

Empirical Software Engineering
Induction of Decision Trees

Machine Learning
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Tree-Based Software Quality Estimation Models For Fault Prediction

METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
Predicting Fault-Prone Modules with Case-Based Reasoning

ISSRE '97 Proceedings of the Eighth International Symposium on Software Reliability Engineering
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Detecting noisy instances with the rule-based classification model

Intelligent Data Analysis
Boosted Classification Trees and Class Probability/Quantile Estimation

The Journal of Machine Learning Research
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
Adequate and Precise Evaluation of Quality Models in Software Engineering Studies

PROMISE '07 Proceedings of the Third International Workshop on Predictor Models in Software Engineering
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
The class imbalance problem: A systematic study

Intelligent Data Analysis
Problems with Precision: A Response to "Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors'"

IEEE Transactions on Software Engineering
Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
A brief introduction to boosting

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Generalized Discrete Software Reliability Modeling With Effect of Program Size

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Software Quality Analysis of Unlabeled Program Modules With Semisupervised Clustering

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Data preparation techniques for improving rare class prediction

MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
An in-depth study of the potentially confounding effect of class size in fault prediction

ACM Transactions on Software Engineering and Methodology (TOSEM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software-quality data sets tend to fall victim to the class-imbalance problem that plagues so many other application domains. The majority of faults in a software system, particularly high-assurance systems, usually lie in a very small percentage of the software modules. This imbalance between the number of fault-prone (fp) and non-fp (nfp) modules can have a severely negative impact on a data-mining technique's ability to differentiate between the two. This paper addresses the class-imbalance problem as it pertains to the domain of software-quality prediction. We present a comprehensive empirical study examining two different methodologies, data sampling and boosting, for improving the performance of decision-tree models designed to identify fp software modules. This paper applies five data-sampling techniques and boosting to 15 software-quality data sets of different sizes and levels of imbalance. Nearly 50 000 models were built for the experiments contained in this paper. Our results show that while data-sampling techniques are very effective in improving the performance of such models, boosting almost always outperforms even the best data-sampling techniques. This significant result, which, to our knowledge, has not been previously reported, has important consequences for practitioners developing software-quality classification models.