An iterative semi-supervised approach to software fault prediction

Authors:
Huihua Lu;Bojan Cukic;Mark Culp
Affiliations:
West Virginia University, Morgantown, WV;West Virginia University, Morgantown, WV;West Virginia University, Morgantown, WV
Venue:
Proceedings of the 7th International Conference on Predictive Models in Software Engineering
Year:
2011

Citing 19
Cited 0

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Random Forests

Machine Learning
Early Quality Prediction: A Case Study in Telecommunications

IEEE Software
Software Measurement: Uncertainty and Causal Modeling

IEEE Software
Investigation of Logistic Regression as a Discriminant of Software Quality

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Prediction of Fault-proneness at Early Phase in Object-Oriented Development

ISORC '99 Proceedings of the 2nd IEEE International Symposium on Object-Oriented Real-Time Distributed Computing
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Robust Prediction of Fault-Proneness by Random Forests

ISSRE '04 Proceedings of the 15th International Symposium on Software Reliability Engineering
Predicting good probabilities with supervised learning

ICML '05 Proceedings of the 22nd international conference on Machine learning
Understanding the Yarowsky Algorithm

Computational Linguistics
Software quality estimation with limited fault data: a semi-supervised learning perspective

Software Quality Control
A Complexity Measure

IEEE Transactions on Software Engineering
Comparing design and code metrics for software quality prediction

Proceedings of the 4th international workshop on Predictor models in software engineering
Theory of relative defect proneness

Empirical Software Engineering
Techniques for evaluating fault prediction models

Empirical Software Engineering
Performance Analysis of Datamining Algorithms for Software Quality Prediction

ARTCOM '09 Proceedings of the 2009 International Conference on Advances in Recent Technologies in Communication and Computing
Predicting software defect density: a case study on automated static code analysis

XP'07 Proceedings of the 8th international conference on Agile processes in software engineering and extreme programming
Software Quality Analysis of Unlabeled Program Modules With Semisupervised Clustering

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Quantified Score

Hi-index	0.00

Visualization

Abstract

Background: Many statistical and machine learning techniques have been implemented to build predictive fault models. Traditional methods are based on supervised learning. Software metrics for a module and corresponding fault information, available from previous projects, are used to train a fault prediction model. This approach calls for a large size of training data set and enables the development of effective fault prediction models. In practice, data collection costs, the lack of data from earlier projects or product versions may make large fault prediction training data set unattainable. Small size of the training set that may be available from the current project is known to deteriorate the performance of the fault predictive model. In semi-supervised learning approaches, software modules with known or unknown fault content can be used for training. Aims: To implement and evaluate a semi-supervised learning approach in software fault prediction. Methods: We investigate an iterative semi-supervised approach to software quality prediction in which a base supervised learner is used within a semi-supervised application. Results: We varied the size of labeled software modules from 2% to 50% of all the modules in the project. After tracking the performance of each iteration in the semi-supervised algorithm, we observe that semi-supervised learning improves fault prediction if the number of initially labeled software modules exceeds 5%. Conclusion: The semi-supervised approach outperforms the corresponding supervised learning approach when both use random forest as base classification algorithm.