Missing values: how many can they be to preserve classification reliability?

Authors:
Martti Juhola;Jorma Laurikkala
Affiliations:
Department of Computer Sciences, University of Tampere, Tampere, Finland 33014;Department of Computer Sciences, University of Tampere, Tampere, Finland 33014
Venue:
Artificial Intelligence Review
Year:
2013

Citing 9
Cited 0

Statistical analysis with missing data

Statistical analysis with missing data
Instance-Based Learning Algorithms

Machine Learning
Methods of knowledge extraction from a clinical database on liver diseases

Computers and Biomedical Research
Data preparation for data mining

Data preparation for data mining
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Nearest neighbour approach in the least-squares data imputation algorithms

Information Sciences: an International Journal
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research
Impact of missing data in evaluating artificial neural networks trained on complete data

Computers in Biology and Medicine
Inductive learning models with missing values

Mathematical and Computer Modelling: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Using five medical datasets we detected the influence of missing values on true positive rates and classification accuracy. We randomly marked more and more values as missing and tested their effects on classification accuracy. The classifications were performed with nearest neighbour searching when none, 10, 20, 30% or more values were missing. We also used discriminant analysis and naïve Bayesian method for the classification. We discovered that for a two-class dataset, despite as high as 20---30% missing values, almost as good results as with no missing value could still be produced. If there are more than two classes, over 10---20% missing values are probably too many, at least for small classes with relatively few cases. The more classes and the more classes of different sizes, a classification task is the more sensitive to missing values. On the other hand, when values are missing on the basis of actual distributions affected by some selection or non-random cause and not fully random, classification can tolerate even high numbers of missing values for some datasets.