Predicting high-risk program modules by selecting the right software measurements

Authors:
Kehan Gao;Taghi M. Khoshgoftaar;Naeem Seliya
Affiliations:
Eastern Connecticut State University, Willimantic, USA 06226;Florida Atlantic University, Boca Raton, USA 33431;University of Michigan--Dearborn, Dearborn, USA 48128
Venue:
Software Quality Control
Year:
2012

Citing 38
Cited 0

Instance-Based Learning Algorithms

Machine Learning
Experimental software engineering: a report on the state of the art

Proceedings of the 17th international conference on Software engineering
Software metrics (2nd ed.): a rigorous and practical approach

Software metrics (2nd ed.): a rigorous and practical approach
Lazy learning

Lazy learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Experimentation in software engineering: an introduction

Experimentation in software engineering: an introduction
Comparing case-based reasoning classifiers for predicting high risk software components

Journal of Systems and Software
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Emerald: Software Metrics and Models on the Desktop

IEEE Software
A Classification Scheme for Studies on Fault-Prone Components

PROFES '01 Proceedings of the Third International Conference on Product Focused Software Process Improvement
Predicting Fault-Proneness using OO Metrics: An Industrial Case Study

CSMR '02 Proceedings of the 6th European Conference on Software Maintenance and Reengineering
An introduction to variable and feature selection

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

IEEE Transactions on Knowledge and Data Engineering
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study

Empirical Software Engineering
A selective sampling approach to active feature selection

Artificial Intelligence
Toward Integrating Feature Selection Algorithms for Classification and Clustering

IEEE Transactions on Knowledge and Data Engineering
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Classifier evaluation under limited resources

Pattern Recognition Letters
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Detecting noisy instances with the rule-based classification model

Intelligent Data Analysis
Adequate and Precise Evaluation of Quality Models in Software Engineering Studies

PROMISE '07 Proceedings of the Third International Workshop on Predictor Models in Software Engineering
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
Learning with Limited Minority Class Data

ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
An Empirical Study of Learning from Imbalanced Data Using Random Forest

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
Visual Analytics for Requirements-driven Risk Assessment

REV '07 Proceedings of the Second International Workshop on Requirements Engineering Visualization
Robust Feature Selection Using Ensemble Feature Selection Techniques

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings

IEEE Transactions on Software Engineering
Enhancing network based intrusion detection for imbalanced data

International Journal of Knowledge-based and Intelligent Engineering Systems
The Impact of Gene Selection on Imbalanced Microarray Expression Data

BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
Hybrid sampling for imbalanced data

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Beyond accuracy, f-score and ROC: a family of discriminant measures for performance evaluation

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
RUSBoost: A Hybrid Approach to Alleviating Class Imbalance

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Quantified Score

Hi-index	0.01

Visualization

Abstract

A timely detection of high-risk program modules in high-assurance software is critical for avoiding the high consequences of operational failures. While software risk can initiate from external sources, such as management or outsourcing, software quality is adversely affected when internal software risks are realized, such as improper practice of standard software processes or lack of a defined software quality infrastructure. Practitioners employ various techniques to identify and rectify high-risk or low-quality program modules. Effectiveness of detecting such modules is affected by the software measurements used, making feature selection an important step during software quality prediction. We use a wrapper-based feature ranking technique to select the optimal set of software metrics to build defect prediction models. We also address the adverse effects of class imbalance (very few low-quality modules compared to high-quality modules), a practical problem observed in high-assurance systems. Applying a data sampling technique followed by feature selection is a relatively unique contribution of our work. We present a comprehensive investigation on the impact of data sampling followed by attribute selection on the defect predictors built with imbalanced data. The case study data are obtained from several real-world high-assurance software projects. The key results are that attribute selection is more efficient when applied after data sampling, and defect prediction performance generally improves after applying data sampling and feature selection.