An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks

Authors:
Jonathan De Andrade Silva;Eduardo Raul Hruschka
Affiliations:
-;-
Venue:
Data & Knowledge Engineering
Year:
2013

Citing 15
Cited 0

Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Data preparation for data mining

Data preparation for data mining
Missing value estimation for DNA microarray gene expression data: local least squares imputation

Bioinformatics
Naive Bayes as an Imputation Tool for Classification Problems

HIS '05 Proceedings of the Fifth International Conference on Hybrid Intelligent Systems
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Handling Missing Values when Applying Classification Models

The Journal of Machine Learning Research
Guest editorial: Recent advances in preserving privacy when mining data

Data & Knowledge Engineering
Privacy-preserving imputation of missing data

Data & Knowledge Engineering
Impact of imputation of missing values on classification error for discrete data

Pattern Recognition
On the influence of imputation in classification: practical issues

Journal of Experimental & Theoretical Artificial Intelligence
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
EACImpute: An Evolutionary Algorithm for Clustering-Based Imputation

ISDA '09 Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications
The Effects and Interactions of Data Quality and Problem Complexity on Classification

Journal of Data and Information Quality (JDIQ)
Towards efficient imputation by nearest-neighbors: a clustering-based approach

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
The efficient imputation method for neighborhood-based collaborative filtering

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. Imputation algorithms have been traditionally compared in terms of the similarity between imputed and original values. However, this traditional approach, sometimes referred to as prediction ability, does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). Based on an extensive experimental work, we study the influence of five nearest-neighbor based imputation algorithms (KNNImpute, SKNN, IKNNImpute, KMI and EACImpute) and two simple algorithms widely used in practice (Mean Imputation and Majority Method) on classification problems. In order to experimentally assess these algorithms, simulations of missing values were performed on six datasets by means of two missingness mechanisms: Missing Completely at Random (MCAR) and Missing at Random (MAR). The latter allows the probabilities of missingness to depend on observed data but not on missing data, whereas the former occurs when the distribution of missingness does not depend on the observed data either. The quality of the imputed values is assessed by two measures: prediction ability and classification bias. Experimental results show that IKNNImpute outperforms the other algorithms in the MCAR mechanism. KNNImpute, SKNN and EACImpute, by their turn, provided the best results in the MAR mechanism. Finally, our experiments also show that best prediction results (in terms of mean squared errors) do not necessarily yield to less classification bias.