Software engineering metrics and models
Software engineering metrics and models
Statistical analysis with missing data
Statistical analysis with missing data
Learning decision tree classifiers
ACM Computing Surveys (CSUR)
Software Cost Estimation with Incomplete Data
IEEE Transactions on Software Engineering
IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Search Heuristics, Case-based Reasoning And Software Project Effort Prediction
GECCO '02 Proceedings of the Genetic and Evolutionary Computation Conference
Using Public Domain Metrics To Estimate Software Development Effort
METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Building A Software Cost Estimation Model Based On Categorical Data
METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Ensemble of missing data techniques to improve software prediction accuracy
Proceedings of the 28th international conference on Software engineering
Benchmarking k-nearest neighbour imputation with homogeneous Likert data
Empirical Software Engineering
A new imputation method for small software project data sets
Journal of Systems and Software
A comprehensive empirical evaluation of missing value imputation in noisy software measurement data
Journal of Systems and Software
Missing Data Imputation Techniques
International Journal of Business Intelligence and Data Mining
Journal of Systems and Software
Imputation techniques for multivariate missingness in software measurement data
Software Quality Control
Aprimorando processos de imputação multivariada de dados com workflows
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES
Applied Artificial Intelligence
Sizing user stories using paired comparisons
Information and Software Technology
Ensemble missing data techniques for software effort prediction
Intelligent Data Analysis
Incomplete-case nearest neighbor imputation in software measurement data
Information Sciences: an International Journal
Hi-index | 0.01 |
A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.