Evaluating the Impact of Missing Data Imputation

Authors:
Adam Pantanowitz;Tshilidzi Marwala
Affiliations:
School of Electrical & Information Engineering, University of the Witwatersrand, Johannesburg, Wits, South Africa 2050;School of Electrical & Information Engineering, University of the Witwatersrand, Johannesburg, Wits, South Africa 2050
Venue:
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Year:
2009

Citing 4
Cited 0

Random decision forests

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Uncertainty Handling and Quality Assesment in Data Mining

Uncertainty Handling and Quality Assesment in Data Mining
Consistency of Random Forests and Other Averaging Classifiers

The Journal of Machine Learning Research
Impact of missing data in evaluating artificial neural networks trained on complete data

Computers in Biology and Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an impact assessment for the imputation of missing data. The assessment is performed by measuring the impacts of missing data on the statistical nature of the data, on a classifier, and on a logistic regression system. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through the use of Random Forests, selected based on best imputation performance above five other techniques. Test sets are developed which consist of the original data and of imputed data with varying numbers of specifically selected missing variables imputed. Results indicate that, for this data set, the evaluated properties and tested paradigms are fairly immune to missing data imputation. The impact is not highly significant, with, for example, linear correlations of 96 % between HIV status probability prediction with a full set and with a set of two imputed variables using the logistic regression analysis.