A Short Note on Safest Default Missingness Mechanism Assumptions

Authors:
Qinbao Song;Martin Shepperd;Michelle Cartwright
Affiliations:
Empirical Software Engineering Research Group, School of Design, Engineering and Computing, Bournemouth University, UK;Empirical Software Engineering Research Group, School of Design, Engineering and Computing, Bournemouth University, UK;Empirical Software Engineering Research Group, School of Design, Engineering and Computing, Bournemouth University, UK
Venue:
Empirical Software Engineering
Year:
2005

Citing 8
Cited 12

Software engineering metrics and models

Software engineering metrics and models
Statistical analysis with missing data

Statistical analysis with missing data
Learning decision tree classifiers

ACM Computing Surveys (CSUR)
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Search Heuristics, Case-based Reasoning And Software Project Effort Prediction

GECCO '02 Proceedings of the Genetic and Evolutionary Computation Conference
Using Public Domain Metrics To Estimate Software Development Effort

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Building A Software Cost Estimation Model Based On Categorical Data

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics

Ensemble of missing data techniques to improve software prediction accuracy

Proceedings of the 28th international conference on Software engineering
Benchmarking k-nearest neighbour imputation with homogeneous Likert data

Empirical Software Engineering
A new imputation method for small software project data sets

Journal of Systems and Software
A comprehensive empirical evaluation of missing value imputation in noisy software measurement data

Journal of Systems and Software
Missing Data Imputation Techniques

International Journal of Business Intelligence and Data Mining
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Journal of Systems and Software
Imputation techniques for multivariate missingness in software measurement data

Software Quality Control
Aprimorando processos de imputação multivariada de dados com workflows

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

Applied Artificial Intelligence
Sizing user stories using paired comparisons

Information and Software Technology
Ensemble missing data techniques for software effort prediction

Intelligent Data Analysis
Incomplete-case nearest neighbor imputation in software measurement data

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.