The problem of disguised missing data

Authors:
Ronald K. Pearson
Affiliations:
ProSanos Corporation, Harrisburg, PA
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2006

Citing 8
Cited 11

Statistical analysis with missing data

Statistical analysis with missing data
Bagging predictors

Machine Learning
Data mining

Data mining
An introduction to database systems (7th ed.)

An introduction to database systems (7th ed.)
Effective Web data extraction with standard XML technologies

Proceedings of the 10th international conference on World Wide Web
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Mining Imperfect Data: Dealing with Contamination and Incomplete Records

Mining Imperfect Data: Dealing with Contamination and Incomplete Records
Modern Applied Statistics with S

Modern Applied Statistics with S

A rough sets based characteristic relation approach for dynamic attribute generalization in data mining

Knowledge-Based Systems
Cleaning disguised missing data: a heuristic approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
DiMaC: a system for cleaning disguised missing data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
DiMaC: a disguised missing data cleaning tool

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Network-Based Analysis of Beijing SARS Data

BioSecure '08 Proceedings of the 2008 International Workshop on Biosurveillance and Biosecurity
Missing Values: Proposition of a Typology and Characterization with an Association Rule-Based Model

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Instance-based classifiers applied to medical databases: Diagnosis and knowledge extraction

Artificial Intelligence in Medicine
Recursive partitioning on incomplete data using surrogate decisions and multiple imputation

Computational Statistics & Data Analysis
Nearest neighbor selection for iteratively kNN imputation

Journal of Systems and Software
Information enhancement for data mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A new variable importance measure for random forests with missing data

Statistics and Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Missing data is a well-recognized problem in large datasets, widely discussed in the statistics and data analysis literature. Many programming environments provide explicit codes for missing data, but these are not standardized and are not always used. This lack of standardization is one of the leading causes of the subtle problem of disguised missing data, in which unknown, inapplicable, or otherwise nonspecified responses are encoded as valid data values. Following a brief overview of the problem of explicitly coded missing data, this paper discusses sources, consequences, and detection of disguised missing data, including two real-world examples. As the first of these examples illustrates, the consequences of disguised missing data can be quite serious. The key to its detection lies in first, recognizing disguised missing data as a possibility and second, finding a sufficiently informative view of the data to reveal its presence.