Benchmarking k-nearest neighbour imputation with homogeneous Likert data

  • Authors:
  • Per Jönsson;Claes Wohlin

  • Affiliations:
  • School of Engineering, Blekinge Institute of Technology, Ronneby, Sweden SE-372 25;School of Engineering, Blekinge Institute of Technology, Ronneby, Sweden SE-372 25

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (k-NN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the method.