Software quality estimation with limited fault data: a semi-supervised learning perspective

Authors:
Naeem Seliya;Taghi M. Khoshgoftaar
Affiliations:
Computer and Information Science, University of Michigan --- Dearborn, Dearborn, USA 48128;Computer Science and Engineering, Florida Atlantic University, Boca Raton, USA 33431
Venue:
Software Quality Control
Year:
2007

Citing 19
Cited 10

Statistical analysis with missing data

Statistical analysis with missing data
Handbook of software reliability engineering

Handbook of software reliability engineering
Software metrics (2nd ed.): a rigorous and practical approach

Software metrics (2nd ed.): a rigorous and practical approach
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Globally Optimal Fuzzy Decision Trees for Classification and Regression

IEEE Transactions on Pattern Analysis and Machine Intelligence
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Comparing case-based reasoning classifiers for predicting high risk software components

Journal of Systems and Software
Software Metrics Data Analysis—Exploring the RelativePerformance of Some Commonly Used Modeling Techniques

Empirical Software Engineering
Balancing Misclassification Rates in Classification-TreeModels of Software Quality

Empirical Software Engineering
Body of Knowledge for Software Quality Measurement

Computer
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Investigation of Logistic Regression as a Discriminant of Software Quality

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Tree-Based Software Quality Estimation Models For Fault Prediction

METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
Experience from Replicating Empirical Studies on Prediction Models

METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
Analogy-Based Practical Classification Rules for Software Quality Estimation

Empirical Software Engineering
Genetic Programming-Based Decision Trees for Software Quality Classification

ICTAI '03 Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence

Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem

Information Sciences: an International Journal
Amulti-instance model for software quality estimation in OO systems

ICNC'09 Proceedings of the 5th international conference on Natural computation
Review: Software fault prediction: A literature review and current trends

Expert Systems with Applications: An International Journal
Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets

Expert Systems with Applications: An International Journal
Software defect detection with rocus

Journal of Computer Science and Technology
Handling missing data in software effort prediction with naive Bayes and EM algorithm

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
An iterative semi-supervised approach to software fault prediction

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
Software defect prediction using semi-supervised learning with dimension reduction

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
Creating Process-Agents incrementally by mining process asset library

Information Sciences: an International Journal
An in-depth study of the potentially confounding effect of class size in fault prediction

ACM Transactions on Software Engineering and Methodology (TOSEM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We addresses the important problem of software quality analysis when there is limited software fault or fault-proneness data. A software quality model is typically trained using software measurement and fault data obtained from a previous release or similar project. Such an approach assumes that fault data is available for all the training modules. Various issues in software development may limit the availability of fault-proneness data for all the training modules. Consequently, the available labeled training dataset is such that the trained software quality model may not provide predictions. More specifically, the small set of modules with known fault-proneness labels is not sufficient for capturing the software quality trends of the project. We investigate semi-supervised learning with the Expectation Maximization (EM) algorithm for software quality estimation with limited fault-proneness data. The hypothesis is that knowledge stored in software attributes of the unlabeled program modules will aid in improving software quality estimation. Software data collected from a large NASA software project is used during the semi-supervised learning process. The software quality model is evaluated with multiple test datasets collected from other NASA software projects. Compared to software quality models trained only with the available set of labeled program modules, the EM-based semi-supervised learning scheme improves generalization performance of the software quality models.