A Short Note on Safest Default Missingness Mechanism Assumptions

  • Authors:
  • Qinbao Song;Martin Shepperd;Michelle Cartwright

  • Affiliations:
  • Empirical Software Engineering Research Group, School of Design, Engineering and Computing, Bournemouth University, UK;Empirical Software Engineering Research Group, School of Design, Engineering and Computing, Bournemouth University, UK;Empirical Software Engineering Research Group, School of Design, Engineering and Computing, Bournemouth University, UK

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.