Benchmarking k-nearest neighbour imputation with homogeneous Likert data

Authors:
Per Jönsson;Claes Wohlin
Affiliations:
School of Engineering, Blekinge Institute of Technology, Ronneby, Sweden SE-372 25;School of Engineering, Blekinge Institute of Technology, Ronneby, Sweden SE-372 25
Venue:
Empirical Software Engineering
Year:
2006

Citing 9
Cited 5

Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Maximum Consistency of Incomplete Datavia Non-Invasive Imputation

Artificial Intelligence Review
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
A Short Note on Safest Default Missingness Mechanism Assumptions

Empirical Software Engineering
A similarity model for detection of conflicts between overlapping STEP application protocols

International Journal of Computer Applications in Technology
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research
Understanding the importance of roles in architecture-related process improvement: a case study

PROFES'05 Proceedings of the 6th international conference on Product Focused Software Process Improvement

A study of the non-linear adjustment for analogy based software cost estimation

Empirical Software Engineering
Methodologies for model-free data interpretation of civil engineering structures

Computers and Structures
Handling incomplete data using evolution of imputation methods

ICANNGA'09 Proceedings of the 9th international conference on Adaptive and natural computing algorithms
Adaptive ridge regression system for software cost estimating on multi-collinear datasets

Journal of Systems and Software
A soft computing system using intelligent imputation strategies for roughness prediction in deep drilling

Journal of Intelligent Manufacturing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (k-NN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the method.