Statistical analysis with missing data
Statistical analysis with missing data
Information and Software Technology
Computational Statistics & Data Analysis
Clustering Algorithms
Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm
Pattern Recognition Letters
Hi-index | 0.00 |
Modeling with real-world data is often plagued with the problem of missing values, limiting the applicability and validity of the developed model. Several algorithms exist in the literature to facilitate the analysis of incomplete data by imputing missing values. However, their imputation accuracy and practical applicability have not been systematically compared and studied. This makes the choice of appropriate imputation method difficult. The focus of this paper is to conduct an exploratory analysis of the popular missing data imputation algorithms. A new imputation algorithm based on clustering is also developed and demonstrated to be useful in a variety of ways to improve the efficiency of imputing missing values. These algorithms are benchmarked using datasets with significantly varying statistical properties. Based on the empirical results and theoretical analysis, a set of guidelines are proposed to assist in the selection of an appropriate imputation algorithm for a specific application. Finally these guidelines are used in a process modeling case study that involves the analysis of the design of an atomizer. It was observed that the imputed values are qualitatively valid thus providing evidence for the appropriateness of the proposed guidelines.