Ensemble of missing data techniques to improve software prediction accuracy

Authors:
Bhekisipho Twala;Michelle Cartwright;Martin Shepperd
Affiliations:
Brunel University, United Kingdom;Brunel University, United Kingdom;Brunel University, United Kingdom
Venue:
Proceedings of the 28th international conference on Software engineering
Year:
2006

Citing 10
Cited 3

C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Validating the ISO/IEC 15504 measures of software development process capability

Journal of Systems and Software
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Imputation of Missing Data in Industrial Databases

Applied Intelligence
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
A Short Note on Safest Default Missingness Mechanism Assumptions

Empirical Software Engineering
Ensemble Imputation Methods for Missing Software Engineering Data

METRICS '05 Proceedings of the 11th IEEE International Software Metrics Symposium

Sensitivity of results to different data quality meta-data criteria in the sample selection of projects from the ISBSG dataset

Proceedings of the 6th International Conference on Predictive Models in Software Engineering
An industrial case study of classifier ensembles for locating software defects

Software Quality Control
An algorithmic approach to missing data problem in modeling human aspects in software development

Proceedings of the 9th International Conference on Predictive Models in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software engineers are commonly faced with the problem of incomplete data. Incomplete data can reduce system performance in terms of predictive accuracy. Unfortunately, rare research has been conducted to systematically explore the impact of missing values, especially from the missing data handling point of view. This has made various missing data techniques (MDTs) less significant. This paper describes a systematic comparison of seven MDTs using eight industrial datasets. Our findings from an empirical evaluation suggest listwise deletion as the least effective technique for handling incomplete data while multiple imputation achieves the highest accuracy rates. We further propose and show how a combination of MDTs by randomizing a decision tree building algorithm leads to a significant improvement in prediction performance for missing values up to 50%.