Statistical analysis with missing data
Statistical analysis with missing data
C4.5: programs for machine learning
C4.5: programs for machine learning
Imputation techniques in regression analysis: looking closely at their implementation
Computational Statistics & Data Analysis
The nature of statistical learning theory
The nature of statistical learning theory
Mixture models for learning from incomplete data
Computational learning theory and natural learning systems: Volume IV
Data mining methods for knowledge discovery
Data mining methods for knowledge discovery
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Imputation of Missing Data in Industrial Databases
Applied Intelligence
Machine Learning
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation
PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Routine Multiple Imputation in Statistical Databases
Proceedings of the Seventh International Working Conference on Scientific and Statistical Database Management
Association-Based Multiple Imputation in Multivariate Datasets: A Summary
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Variational Bayesian learning of ICA with missing data
Neural Computation
Web-Based Knowledge Acquisition to Impute Missing Values for Classification
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Learning trees and rules with set-valued features
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
A Novel Framework for Imputation of Missing Values in Databases
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Exploiting Data Missingness in Bayesian Network Modeling
IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Learn++.MF: A random subspace approach for the missing feature problem
Pattern Recognition
Predicting incomplete gene microarray data with the use of supervised learning algorithms
Pattern Recognition Letters
On-line classification of data streams with missing values based on reinforcement learning
IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
A unifying view on dataset shift in classification
Pattern Recognition
Finding the game flow from sports video
J-MRE '11 Proceedings of the 2011 joint ACM workshop on Modeling and representing events
A robust missing value imputation method for noisy data
Applied Intelligence
Recursive partitioning on incomplete data using surrogate decisions and multiple imputation
Computational Statistics & Data Analysis
A classifier ensemble approach for the missing feature problem
Artificial Intelligence in Medicine
Expert Systems with Applications: An International Journal
Optimum estimation of missing values in randomized complete block design by genetic algorithm
Knowledge-Based Systems
Data & Knowledge Engineering
Distance estimation in numerical data sets with missing values
Information Sciences: an International Journal
Dynamic discriminant functions with missing feature values
Pattern Recognition Letters
Advances in Artificial Intelligence
Hi-index | 0.01 |
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Nai@?ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Nai@?ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Nai@?ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Nai@?ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.