Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem

Authors:
Cagatay Catal;Banu Diri
Affiliations:
TUBITAK-Marmara Research Center, Information Technologies Institute, Gebze, Kocaeli 41470, Turkey;TUBITAK-Marmara Research Center, Information Technologies Institute, Gebze, Kocaeli 41470, Turkey
Venue:
Information Sciences: an International Journal
Year:
2009

Citing 32
Cited 11

Experimentation in software engineering: an introduction

Experimentation in software engineering: an introduction
Comparing case-based reasoning classifiers for predicting high risk software components

Journal of Systems and Software
Elements of Software Science (Operating and programming systems series)

Elements of Software Science (Operating and programming systems series)
Software Engineering Measurement

Software Engineering Measurement
An Application of Fuzzy Clustering to Software Quality Prediction

ASSET '00 Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSET'00)
Software Quality Classification Modeling Using The SPRINT Decision Tree Algorithm

ICTAI '02 Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence
Application of Neural Networks for Software Quality Prediction Using Object-Oriented Metrics

ICSM '03 Proceedings of the International Conference on Software Maintenance
Semi-Supervised Learning for Software Quality Estimation

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
An investigation of the effect of module size on defect prediction using static measures

PROMISE '05 Proceedings of the 2005 workshop on Predictor models in software engineering
An empirical study of predicting software faults with case-based reasoning

Software Quality Control
Learning to classify e-mail

Information Sciences: an International Journal
Object-oriented software fault prediction using neural networks

Information and Software Technology
Weighting fuzzy classification rules using receiver operating characteristics (ROC) analysis

Information Sciences: an International Journal
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
Empirical Validation of Three Software Metrics Suites to Predict Fault-Proneness of Object-Oriented Classes Developed Using Highly Iterative or Agile Software Development Processes

IEEE Transactions on Software Engineering
Software quality estimation with limited fault data: a semi-supervised learning perspective

Software Quality Control
A Complexity Measure

IEEE Transactions on Software Engineering
Empirical Analysis of Software Fault Content and Fault Proneness Using Bayesian Methods

IEEE Transactions on Software Engineering
A weighted rough set based method developed for class imbalance learning

Information Sciences: an International Journal
Applying machine learning to software fault-proneness prediction

Journal of Systems and Software
Software defect prediction using artificial immune recognition system

SE'07 Proceedings of the 25th conference on IASTED International Multi-Conference: Software Engineering
Designing of classifiers based on immune principles and fuzzy rules

Information Sciences: an International Journal
Predicting defect-prone software modules using support vector machines

Journal of Systems and Software
A hybrid artificial immune system and Self Organising Map for network intrusion detection

Information Sciences: an International Journal
A Conceptual Framework to Integrate Fault Prediction Sub-Process for Software Product Lines

TASE '08 Proceedings of the 2008 2nd IFIP/IEEE International Symposium on Theoretical Aspects of Software Engineering
Neighborhood rough set based heterogeneous feature subset selection

Information Sciences: an International Journal
An antibody network inspired evolutionary framework for distributed object computing

Information Sciences: an International Journal
Estimating software readiness using predictive models

Information Sciences: an International Journal
Immune K-means and negative selection algorithms for data analysis

Information Sciences: an International Journal
Unsupervised learning for expert-based software quality estimation

HASE'04 Proceedings of the Eighth IEEE international conference on High assurance systems engineering
Identification of defect-prone classes in telecommunication software systems using design metrics

Information Sciences: an International Journal
Software fault prediction with object-oriented metrics based artificial immune recognition system

PROFES'07 Proceedings of the 8th international conference on Product-Focused Software Process Improvement

Feature selection for multi-label naive Bayes classification

Information Sciences: an International Journal
Review: Software fault prediction: A literature review and current trends

Expert Systems with Applications: An International Journal
Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets

Expert Systems with Applications: An International Journal
Design and implementation of a t-way test data generation strategy with automated execution tool support

Information Sciences: an International Journal
Localizing program logical errors using extraction of knowledge from invariants

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Transfer learning for cross-company software defect prediction

Information and Software Technology
User preferences based software defect detection algorithms selection using MCDM

Information Sciences: an International Journal
Searching for rules to detect defective modules: A subgroup discovery approach

Information Sciences: an International Journal
An evolutionary programming based asymmetric weighted least squares support vector machine ensemble learning methodology for software repository mining

Information Sciences: an International Journal
The design of polynomial function-based neural network predictors for detection of software defects

Information Sciences: an International Journal
Creating Process-Agents incrementally by mining process asset library

Information Sciences: an International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

Software quality engineering comprises of several quality assurance activities such as testing, formal verification, inspection, fault tolerance, and software fault prediction. Until now, many researchers developed and validated several fault prediction models by using machine learning and statistical techniques. There have been used different kinds of software metrics and diverse feature reduction techniques in order to improve the models' performance. However, these studies did not investigate the effect of dataset size, metrics set, and feature selection techniques for software fault prediction. This study is focused on the high-performance fault predictors based on machine learning such as Random Forests and the algorithms based on a new computational intelligence approach called Artificial Immune Systems. We used public NASA datasets from the PROMISE repository to make our predictive models repeatable, refutable, and verifiable. The research questions were based on the effects of dataset size, metrics set, and feature selection techniques. In order to answer these questions, there were defined seven test groups. Additionally, nine classifiers were examined for each of the five public NASA datasets. According to this study, Random Forests provides the best prediction performance for large datasets and Naive Bayes is the best prediction algorithm for small datasets in terms of the Area Under Receiver Operating Characteristics Curve (AUC) evaluation parameter. The parallel implementation of Artificial Immune Recognition Systems (AIRS2Parallel) algorithm is the best Artificial Immune Systems paradigm-based algorithm when the method-level metrics are used.