On the influence of imputation in classification: practical issues

Authors:
Eduardo R. Hruschka;Antonio J. T. Garcia;Estevam R. Hruschka, Jr;Nelson F. F. Ebecken
Affiliations:
Computer Science Department, University of Sao Paulo (USP), Sao Carlos, Brazil;IBM Software Group, Sao Paulo, Brazil;Computer Science Department, Federal University of Sao Carlos (UFSCAR), Sao Carlos, Brazil;COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Venue:
Journal of Experimental & Theoretical Artificial Intelligence
Year:
2009

Citing 26
Cited 2

Statistical analysis with missing data

Statistical analysis with missing data
Probabilistic induction by dynamic part generation in virtual trees

Proceedings of Expert Systems '86, The 6Th Annual Technical Conference on Research and development in expert systems III
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Unknown attribute values in induction

Proceedings of the sixth international workshop on Machine learning
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Convergence results for the EM approach to mixtures of experts architectures

Neural Networks
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Data preparation for data mining

Data preparation for data mining
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Mining massively incomplete data sets by conceptual reconstruction

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Learning missing values from summary constraints

ACM SIGKDD Explorations Newsletter
Induction of Decision Trees

Machine Learning
Learning from Incomplete Data

Learning from Incomplete Data
Nearest neighbour approach in the least-squares data imputation algorithms

Information Sciences: an International Journal
Economical active feature-value acquisition through Expected Utility estimation

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Missing value estimation for DNA microarray gene expression data: local least squares imputation

Bioinformatics
Incorporating an EM-Approach for Handling Missing Attribute-Values in Decision Tree Induction

HIS '05 Proceedings of the Fifth International Conference on Hybrid Intelligent Systems
Naive Bayes as an Imputation Tool for Classification Problems

HIS '05 Proceedings of the Fifth International Conference on Hybrid Intelligent Systems
The influence of missing value imputation on detection of differentially expressed genes from microarray data

Bioinformatics
A new imputation method for small software project data sets

Journal of Systems and Software
Missing values prediction with K2

Intelligent Data Analysis
Bayesian networks for imputation in classification problems

Journal of Intelligent Information Systems
Lazy decision trees

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Towards efficient imputation by nearest-neighbors: a clustering-based approach

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence

An Evolutionary Algorithm for Missing Values Substitution in Classification Tasks

HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.