Impact of imputation of missing values on classification error for discrete data

Authors:
Alireza Farhangfar;Lukasz Kurgan;Jennifer Dy
Affiliations:
Department of Computing Sciences, University of Alberta, Edmonton, Alberta, Canada;Department of Electrical and Computer Engineering, ECERF, 9107-116 Street, University of Alberta, Edmonton, Alberta, Canada T6G 2V4;Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115, USA
Venue:
Pattern Recognition
Year:
2008

Citing 17
Cited 19

Statistical analysis with missing data

Statistical analysis with missing data
C4.5: programs for machine learning

C4.5: programs for machine learning
Imputation techniques in regression analysis: looking closely at their implementation

Computational Statistics & Data Analysis
The nature of statistical learning theory

The nature of statistical learning theory
Mixture models for learning from incomplete data

Computational learning theory and natural learning systems: Volume IV
Data mining methods for knowledge discovery

Data mining methods for knowledge discovery
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Imputation of Missing Data in Industrial Databases

Applied Intelligence
The CN2 Induction Algorithm

Machine Learning
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Routine Multiple Imputation in Statistical Databases

Proceedings of the Seventh International Working Conference on Scientific and Statistical Database Management
Association-Based Multiple Imputation in Multivariate Datasets: A Summary

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Variational Bayesian learning of ICA with missing data

Neural Computation
Web-Based Knowledge Acquisition to Impute Missing Values for Classification

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Learning trees and rules with set-valued features

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
A Novel Framework for Imputation of Missing Values in Databases

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Exploiting Data Missingness in Bayesian Network Modeling

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
The theoretic framework of local weighted approximation for microarray missing value estimation

Pattern Recognition
Learn++.MF: A random subspace approach for the missing feature problem

Pattern Recognition
Predicting incomplete gene microarray data with the use of supervised learning algorithms

Pattern Recognition Letters
On-line classification of data streams with missing values based on reinforcement learning

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
A unifying view on dataset shift in classification

Pattern Recognition
Finding the game flow from sports video

J-MRE '11 Proceedings of the 2011 joint ACM workshop on Modeling and representing events
Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data

Pattern Recognition
A robust missing value imputation method for noisy data

Applied Intelligence
Recursive partitioning on incomplete data using surrogate decisions and multiple imputation

Computational Statistics & Data Analysis
A classifier ensemble approach for the missing feature problem

Artificial Intelligence in Medicine
An analysis on the use of pre-processing methods in evolutionary fuzzy systems for subgroup discovery

Expert Systems with Applications: An International Journal
Optimum estimation of missing values in randomized complete block design by genetic algorithm

Knowledge-Based Systems
An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks

Data & Knowledge Engineering
Distance estimation in numerical data sets with missing values

Information Sciences: an International Journal
Locally linear reconstruction based missing value imputation for supervised learning

Neurocomputing
Dynamic discriminant functions with missing feature values

Pattern Recognition Letters
Imprecise imputation as a tool for solving classification problems with mean values of unobserved features

Advances in Artificial Intelligence
Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques

Knowledge-Based Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Nai@?ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Nai@?ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Nai@?ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Nai@?ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.