DiMaC: a system for cleaning disguised missing data

Authors:
Ming Hua;Jian Pei
Affiliations:
Simon Fraser University, Burnaby, BC, Canada;Simon Fraser University, Burnaby, BC, Canada
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 5
Cited 1

Statistical analysis with missing data

Statistical analysis with missing data
The EM algorithm for graphical association models with missing data

Computational Statistics & Data Analysis - Special issue dedicated to Toma´sˇ Havra´nek
The problem of disguised missing data

ACM SIGKDD Explorations Newsletter
Cleaning disguised missing data: a heuristic approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Further experimental evidence against the utility of Occam's razor

Journal of Artificial Intelligence Research

Information enhancement for data mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely. The very limited previous studies on cleaning disguised missing data highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers. Recently, we have studied the problem of cleaning disguised missing data systematically, and proposed an effective heuristic approach [2]. In this paper, we describe a demonstration of DiMaC, a Disguised Missing Data Cleaning system which can find the frequently used disguise values in data sets without requiring any domain background knowledge. In this demo, we will show (1) the critical techniques of finding suspicious disguise values; (2) the architecture and user interface of DiMaC system; (3) an empirical case study on both real and synthetic data sets, which verifies the effectiveness and the efficiency of the techniques; (4) some challenges arising from real applications and several direction for future work.